1. 08 10月, 2018 6 次提交
  2. 27 9月, 2018 1 次提交
    • G
      bcache: add separate workqueue for journal_write to avoid deadlock · 0f843e65
      Guoju Fang 提交于
      After write SSD completed, bcache schedules journal_write work to
      system_wq, which is a public workqueue in system, without WQ_MEM_RECLAIM
      flag. system_wq is also a bound wq, and there may be no idle kworker on
      current processor. Creating a new kworker may unfortunately need to
      reclaim memory first, by shrinking cache and slab used by vfs, which
      depends on bcache device. That's a deadlock.
      
      This patch create a new workqueue for journal_write with WQ_MEM_RECLAIM
      flag. It's rescuer thread will work to avoid the deadlock.
      Signed-off-by: NGuoju Fang <fangguoju@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0f843e65
  3. 12 8月, 2018 9 次提交
  4. 09 8月, 2018 3 次提交
    • C
      bcache: set max writeback rate when I/O request is idle · ea8c5356
      Coly Li 提交于
      Commit b1092c9a ("bcache: allow quick writeback when backing idle")
      allows the writeback rate to be faster if there is no I/O request on a
      bcache device. It works well if there is only one bcache device attached
      to the cache set. If there are many bcache devices attached to a cache
      set, it may introduce performance regression because multiple faster
      writeback threads of the idle bcache devices will compete the btree level
      locks with the bcache device who have I/O requests coming.
      
      This patch fixes the above issue by only permitting fast writebac when
      all bcache devices attached on the cache set are idle. And if one of the
      bcache devices has new I/O request coming, minimized all writeback
      throughput immediately and let PI controller __update_writeback_rate()
      to decide the upcoming writeback rate for each bcache device.
      
      Also when all bcache devices are idle, limited wrieback rate to a small
      number is wast of thoughput, especially when backing devices are slower
      non-rotation devices (e.g. SATA SSD). This patch sets a max writeback
      rate for each backing device if the whole cache set is idle. A faster
      writeback rate in idle time means new I/Os may have more available space
      for dirty data, and people may observe a better write performance then.
      
      Please note bcache may change its cache mode in run time, and this patch
      still works if the cache mode is switched from writeback mode and there
      is still dirty data on cache.
      
      Fixes: Commit b1092c9a ("bcache: allow quick writeback when backing idle")
      Cc: stable@vger.kernel.org #4.16+
      Signed-off-by: NColy Li <colyli@suse.de>
      Tested-by: NKai Krakow <kai@kaishome.de>
      Tested-by: NStefan Priebe <s.priebe@profihost.ag>
      Cc: Michael Lyle <mlyle@lyle.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ea8c5356
    • C
      bcache: add a comment in super.c · e57fd746
      Coly Li 提交于
      This patch adds a line of code comment in super.c:register_bdev(), to
      make code to be more comprehensible.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e57fd746
    • C
      bcache: do not check return value of debugfs_create_dir() · 78ac2107
      Coly Li 提交于
      Greg KH suggests that normal code should not care about debugfs. Therefore
      no matter successful or failed of debugfs_create_dir() execution, it is
      unncessary to check its return value.
      
      There are two functions called debugfs_create_dir() and check the return
      value, which are bch_debug_init() and closure_debug_init(). This patch
      changes these two functions from int to void type, and ignore return values
      of debugfs_create_dir().
      
      This patch does not fix exact bug, just makes things work as they should.
      Signed-off-by: NColy Li <colyli@suse.de>
      Suggested-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: stable@vger.kernel.org
      Cc: Kai Krakow <kai@kaishome.de>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      78ac2107
  5. 27 7月, 2018 5 次提交
  6. 13 6月, 2018 2 次提交
    • K
      treewide: Use array_size() in vzalloc() · fad953ce
      Kees Cook 提交于
      The vzalloc() function has no 2-factor argument form, so multiplication
      factors need to be wrapped in array_size(). This patch replaces cases of:
      
              vzalloc(a * b)
      
      with:
              vzalloc(array_size(a, b))
      
      as well as handling cases of:
      
              vzalloc(a * b * c)
      
      with:
      
              vzalloc(array3_size(a, b, c))
      
      This does, however, attempt to ignore constant size factors like:
      
              vzalloc(4 * 1024)
      
      though any constants defined via macros get caught up in the conversion.
      
      Any factors with a sizeof() of "unsigned char", "char", and "u8" were
      dropped, since they're redundant.
      
      The Coccinelle script used for this was:
      
      // Fix redundant parens around sizeof().
      @@
      type TYPE;
      expression THING, E;
      @@
      
      (
        vzalloc(
      -	(sizeof(TYPE)) * E
      +	sizeof(TYPE) * E
        , ...)
      |
        vzalloc(
      -	(sizeof(THING)) * E
      +	sizeof(THING) * E
        , ...)
      )
      
      // Drop single-byte sizes and redundant parens.
      @@
      expression COUNT;
      typedef u8;
      typedef __u8;
      @@
      
      (
        vzalloc(
      -	sizeof(u8) * (COUNT)
      +	COUNT
        , ...)
      |
        vzalloc(
      -	sizeof(__u8) * (COUNT)
      +	COUNT
        , ...)
      |
        vzalloc(
      -	sizeof(char) * (COUNT)
      +	COUNT
        , ...)
      |
        vzalloc(
      -	sizeof(unsigned char) * (COUNT)
      +	COUNT
        , ...)
      |
        vzalloc(
      -	sizeof(u8) * COUNT
      +	COUNT
        , ...)
      |
        vzalloc(
      -	sizeof(__u8) * COUNT
      +	COUNT
        , ...)
      |
        vzalloc(
      -	sizeof(char) * COUNT
      +	COUNT
        , ...)
      |
        vzalloc(
      -	sizeof(unsigned char) * COUNT
      +	COUNT
        , ...)
      )
      
      // 2-factor product with sizeof(type/expression) and identifier or constant.
      @@
      type TYPE;
      expression THING;
      identifier COUNT_ID;
      constant COUNT_CONST;
      @@
      
      (
        vzalloc(
      -	sizeof(TYPE) * (COUNT_ID)
      +	array_size(COUNT_ID, sizeof(TYPE))
        , ...)
      |
        vzalloc(
      -	sizeof(TYPE) * COUNT_ID
      +	array_size(COUNT_ID, sizeof(TYPE))
        , ...)
      |
        vzalloc(
      -	sizeof(TYPE) * (COUNT_CONST)
      +	array_size(COUNT_CONST, sizeof(TYPE))
        , ...)
      |
        vzalloc(
      -	sizeof(TYPE) * COUNT_CONST
      +	array_size(COUNT_CONST, sizeof(TYPE))
        , ...)
      |
        vzalloc(
      -	sizeof(THING) * (COUNT_ID)
      +	array_size(COUNT_ID, sizeof(THING))
        , ...)
      |
        vzalloc(
      -	sizeof(THING) * COUNT_ID
      +	array_size(COUNT_ID, sizeof(THING))
        , ...)
      |
        vzalloc(
      -	sizeof(THING) * (COUNT_CONST)
      +	array_size(COUNT_CONST, sizeof(THING))
        , ...)
      |
        vzalloc(
      -	sizeof(THING) * COUNT_CONST
      +	array_size(COUNT_CONST, sizeof(THING))
        , ...)
      )
      
      // 2-factor product, only identifiers.
      @@
      identifier SIZE, COUNT;
      @@
      
        vzalloc(
      -	SIZE * COUNT
      +	array_size(COUNT, SIZE)
        , ...)
      
      // 3-factor product with 1 sizeof(type) or sizeof(expression), with
      // redundant parens removed.
      @@
      expression THING;
      identifier STRIDE, COUNT;
      type TYPE;
      @@
      
      (
        vzalloc(
      -	sizeof(TYPE) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        vzalloc(
      -	sizeof(TYPE) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        vzalloc(
      -	sizeof(TYPE) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        vzalloc(
      -	sizeof(TYPE) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        vzalloc(
      -	sizeof(THING) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        vzalloc(
      -	sizeof(THING) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        vzalloc(
      -	sizeof(THING) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        vzalloc(
      -	sizeof(THING) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      )
      
      // 3-factor product with 2 sizeof(variable), with redundant parens removed.
      @@
      expression THING1, THING2;
      identifier COUNT;
      type TYPE1, TYPE2;
      @@
      
      (
        vzalloc(
      -	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        vzalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        vzalloc(
      -	sizeof(THING1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        vzalloc(
      -	sizeof(THING1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        vzalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      |
        vzalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      )
      
      // 3-factor product, only identifiers, with redundant parens removed.
      @@
      identifier STRIDE, SIZE, COUNT;
      @@
      
      (
        vzalloc(
      -	(COUNT) * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vzalloc(
      -	COUNT * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vzalloc(
      -	COUNT * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vzalloc(
      -	(COUNT) * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vzalloc(
      -	COUNT * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vzalloc(
      -	(COUNT) * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vzalloc(
      -	(COUNT) * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vzalloc(
      -	COUNT * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      )
      
      // Any remaining multi-factor products, first at least 3-factor products
      // when they're not all constants...
      @@
      expression E1, E2, E3;
      constant C1, C2, C3;
      @@
      
      (
        vzalloc(C1 * C2 * C3, ...)
      |
        vzalloc(
      -	E1 * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      )
      
      // And then all remaining 2 factors products when they're not all constants.
      @@
      expression E1, E2;
      constant C1, C2;
      @@
      
      (
        vzalloc(C1 * C2, ...)
      |
        vzalloc(
      -	E1 * E2
      +	array_size(E1, E2)
        , ...)
      )
      Signed-off-by: NKees Cook <keescook@chromium.org>
      fad953ce
    • K
      treewide: kzalloc() -> kcalloc() · 6396bb22
      Kees Cook 提交于
      The kzalloc() function has a 2-factor argument form, kcalloc(). This
      patch replaces cases of:
      
              kzalloc(a * b, gfp)
      
      with:
              kcalloc(a * b, gfp)
      
      as well as handling cases of:
      
              kzalloc(a * b * c, gfp)
      
      with:
      
              kzalloc(array3_size(a, b, c), gfp)
      
      as it's slightly less ugly than:
      
              kzalloc_array(array_size(a, b), c, gfp)
      
      This does, however, attempt to ignore constant size factors like:
      
              kzalloc(4 * 1024, gfp)
      
      though any constants defined via macros get caught up in the conversion.
      
      Any factors with a sizeof() of "unsigned char", "char", and "u8" were
      dropped, since they're redundant.
      
      The Coccinelle script used for this was:
      
      // Fix redundant parens around sizeof().
      @@
      type TYPE;
      expression THING, E;
      @@
      
      (
        kzalloc(
      -	(sizeof(TYPE)) * E
      +	sizeof(TYPE) * E
        , ...)
      |
        kzalloc(
      -	(sizeof(THING)) * E
      +	sizeof(THING) * E
        , ...)
      )
      
      // Drop single-byte sizes and redundant parens.
      @@
      expression COUNT;
      typedef u8;
      typedef __u8;
      @@
      
      (
        kzalloc(
      -	sizeof(u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kzalloc(
      -	sizeof(__u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kzalloc(
      -	sizeof(char) * (COUNT)
      +	COUNT
        , ...)
      |
        kzalloc(
      -	sizeof(unsigned char) * (COUNT)
      +	COUNT
        , ...)
      |
        kzalloc(
      -	sizeof(u8) * COUNT
      +	COUNT
        , ...)
      |
        kzalloc(
      -	sizeof(__u8) * COUNT
      +	COUNT
        , ...)
      |
        kzalloc(
      -	sizeof(char) * COUNT
      +	COUNT
        , ...)
      |
        kzalloc(
      -	sizeof(unsigned char) * COUNT
      +	COUNT
        , ...)
      )
      
      // 2-factor product with sizeof(type/expression) and identifier or constant.
      @@
      type TYPE;
      expression THING;
      identifier COUNT_ID;
      constant COUNT_CONST;
      @@
      
      (
      - kzalloc
      + kcalloc
        (
      -	sizeof(TYPE) * (COUNT_ID)
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(TYPE) * COUNT_ID
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(TYPE) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(TYPE) * COUNT_CONST
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(THING) * (COUNT_ID)
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(THING) * COUNT_ID
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(THING) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(THING)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(THING) * COUNT_CONST
      +	COUNT_CONST, sizeof(THING)
        , ...)
      )
      
      // 2-factor product, only identifiers.
      @@
      identifier SIZE, COUNT;
      @@
      
      - kzalloc
      + kcalloc
        (
      -	SIZE * COUNT
      +	COUNT, SIZE
        , ...)
      
      // 3-factor product with 1 sizeof(type) or sizeof(expression), with
      // redundant parens removed.
      @@
      expression THING;
      identifier STRIDE, COUNT;
      type TYPE;
      @@
      
      (
        kzalloc(
      -	sizeof(TYPE) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kzalloc(
      -	sizeof(TYPE) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kzalloc(
      -	sizeof(TYPE) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kzalloc(
      -	sizeof(TYPE) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kzalloc(
      -	sizeof(THING) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kzalloc(
      -	sizeof(THING) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kzalloc(
      -	sizeof(THING) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kzalloc(
      -	sizeof(THING) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      )
      
      // 3-factor product with 2 sizeof(variable), with redundant parens removed.
      @@
      expression THING1, THING2;
      identifier COUNT;
      type TYPE1, TYPE2;
      @@
      
      (
        kzalloc(
      -	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kzalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kzalloc(
      -	sizeof(THING1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kzalloc(
      -	sizeof(THING1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kzalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      |
        kzalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      )
      
      // 3-factor product, only identifiers, with redundant parens removed.
      @@
      identifier STRIDE, SIZE, COUNT;
      @@
      
      (
        kzalloc(
      -	(COUNT) * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc(
      -	COUNT * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc(
      -	COUNT * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc(
      -	(COUNT) * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc(
      -	COUNT * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc(
      -	(COUNT) * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc(
      -	(COUNT) * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc(
      -	COUNT * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      )
      
      // Any remaining multi-factor products, first at least 3-factor products,
      // when they're not all constants...
      @@
      expression E1, E2, E3;
      constant C1, C2, C3;
      @@
      
      (
        kzalloc(C1 * C2 * C3, ...)
      |
        kzalloc(
      -	(E1) * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kzalloc(
      -	(E1) * (E2) * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kzalloc(
      -	(E1) * (E2) * (E3)
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kzalloc(
      -	E1 * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      )
      
      // And then all remaining 2 factors products when they're not all constants,
      // keeping sizeof() as the second factor argument.
      @@
      expression THING, E1, E2;
      type TYPE;
      constant C1, C2, C3;
      @@
      
      (
        kzalloc(sizeof(THING) * C2, ...)
      |
        kzalloc(sizeof(TYPE) * C2, ...)
      |
        kzalloc(C1 * C2 * C3, ...)
      |
        kzalloc(C1 * C2, ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(TYPE) * (E2)
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(TYPE) * E2
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(THING) * (E2)
      +	E2, sizeof(THING)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(THING) * E2
      +	E2, sizeof(THING)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	(E1) * E2
      +	E1, E2
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	(E1) * (E2)
      +	E1, E2
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	E1 * E2
      +	E1, E2
        , ...)
      )
      Signed-off-by: NKees Cook <keescook@chromium.org>
      6396bb22
  7. 31 5月, 2018 1 次提交
  8. 29 5月, 2018 2 次提交
    • A
      bcache: Move couple of string arrays to sysfs.c · 04cbc211
      Andy Shevchenko 提交于
      There is couple of string arrays that are used exclusively in sysfs.c.
      Move it to there and make them static.
      
      Besides above, it will allow further clean up.
      
      No functional change intended.
      Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      04cbc211
    • C
      bcache: stop bcache device when backing device is offline · 0f0709e6
      Coly Li 提交于
      Currently bcache does not handle backing device failure, if backing
      device is offline and disconnected from system, its bcache device can still
      be accessible. If the bcache device is in writeback mode, I/O requests even
      can success if the requests hit on cache device. That is to say, when and
      how bcache handles offline backing device is undefined.
      
      This patch tries to handle backing device offline in a rather simple way,
      - Add cached_dev->status_update_thread kernel thread to update backing
        device status in every 1 second.
      - Add cached_dev->offline_seconds to record how many seconds the backing
        device is observed to be offline. If the backing device is offline for
        BACKING_DEV_OFFLINE_TIMEOUT (30) seconds, set dc->io_disable to 1 and
        call bcache_device_stop() to stop the bache device which linked to the
        offline backing device.
      
      Now if a backing device is offline for BACKING_DEV_OFFLINE_TIMEOUT seconds,
      its bcache device will be removed, then user space application writing on
      it will get error immediately, and handler the device failure in time.
      
      This patch is quite simple, does not handle more complicated situations.
      Once the bcache device is stopped, users need to recovery the backing
      device, register and attach it manually.
      
      Changelog:
      v3: call wait_for_kthread_stop() before exits kernel thread.
      v2: remove "bcache: " prefix when calling pr_warn().
      v1: initial version.
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Cc: Michael Lyle <mlyle@lyle.org>
      Cc: Junhui Tang <tang.junhui@zte.com.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0f0709e6
  9. 03 5月, 2018 4 次提交
    • C
      bcache: use pr_info() to inform duplicated CACHE_SET_IO_DISABLE set · 09a44ca2
      Coly Li 提交于
      It is possible that multiple I/O requests hits on failed cache device or
      backing device, therefore it is quite common that CACHE_SET_IO_DISABLE is
      set already when a task tries to set the bit from bch_cache_set_error().
      Currently the message "CACHE_SET_IO_DISABLE already set" is printed by
      pr_warn(), which might mislead users to think a serious fault happens in
      source code.
      
      This patch uses pr_info() to print the information in such situation,
      avoid extra worries. This information is helpful to understand bcache
      behavior in cache device failures, so I still keep them in source code.
      
      Fixes: 771f393e ("bcache: add CACHE_SET_IO_DISABLE to struct cache_set flags")
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      09a44ca2
    • C
      bcache: set dc->io_disable to true in conditional_stop_bcache_device() · 4fd8e138
      Coly Li 提交于
      Commit 7e027ca4 ("bcache: add stop_when_cache_set_failed option to
      backing device") adds stop_when_cache_set_failed option and stops bcache
      device if stop_when_cache_set_failed is auto and there is dirty data on
      broken cache device. There might exists a small time gap that the cache
      set is released and set to NULL but bcache device is not released yet
      (because they are released in parallel). During this time gap, dc->c is
      NULL so CACHE_SET_IO_DISABLE won't be checked, and dc->io_disable is still
      false, so new coming I/O requests will be accepted and directly go into
      backing device as no cache set attached to. If there is dirty data on
      cache device, this behavior may introduce potential inconsistent data.
      
      This patch sets dc->io_disable to true before calling bcache_device_stop()
      to make sure the backing device will reject new coming I/O request as
      well, so even in the small time gap no I/O will directly go into backing
      device to corrupt data consistency.
      
      Fixes: 7e027ca4 ("bcache: add stop_when_cache_set_failed option to backing device")
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4fd8e138
    • C
      bcache: set CACHE_SET_IO_DISABLE in bch_cached_dev_error() · 6147305c
      Coly Li 提交于
      Commit c7b7bd07 ("bcache: add io_disable to struct cached_dev") tries
      to stop bcache device by calling bcache_device_stop() when too many I/O
      errors happened on backing device. But if there is internal I/O happening
      on cache device (writeback scan, garbage collection, etc), a regular I/O
      request triggers the internal I/Os may still holds a refcount of dc->count,
      and the refcount may only be dropped after the internal I/O stopped.
      
      By this patch, bch_cached_dev_error() will check if the backing device is
      attached to a cache set, if yes that CACHE_SET_IO_DISABLE will be set to
      flags of this cache set. Then internal I/Os on cache device will be
      rejected and stopped immediately, and the bcache device can be stopped.
      
      For people who are not familiar with the interesting refcount dependance,
      let me explain a bit more how the fix works. Example the writeback thread
      will scan cache device for dirty data writeback purpose. Before it stopps,
      it holds a refcount of dc->count. When CACHE_SET_IO_DISABLE bit is set,
      the internal I/O will stopped and the while-loop in bch_writeback_thread()
      quits and calls cached_dev_put() to drop dc->count. If this is the last
      refcount to drop, then cached_dev_detach_finish() will be called. In this
      call back function, in turn closure_put(dc->disk.cl) is called to drop a
      refcount of closure dc->disk.cl. If this is the last refcount of this
      closure to drop, then cached_dev_flush() will be called. Then the cached
      device is freed. So if CACHE_SET_IO_DISABLE is not set, the bache device
      can not be stopped until all inernal cache device I/O stopped. For large
      size cache device, and writeback thread competes locks with gc thread,
      there might be a quite long time to wait.
      
      Fixes: c7b7bd07 ("bcache: add io_disable to struct cached_dev")
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6147305c
    • C
      bcache: store disk name in struct cache and struct cached_dev · 6e916a7e
      Coly Li 提交于
      Current code uses bdevname() or bio_devname() to reference gendisk
      disk name when bcache needs to display the disk names in kernel message.
      It was safe before bcache device failure handling patch set merged in,
      because when devices are failed, there was deadlock to prevent bcache
      printing error messages with gendisk disk name. But after the failure
      handling patch set merged, the deadlock is fixed, so it is possible
      that the gendisk structure bdev->hd_disk is released when bdevname() is
      called to reference bdev->bd_disk->disk_name[]. This is why I receive
      bug report of NULL pointers deference panic.
      
      This patch stores gendisk disk name in a buffer inside struct cache and
      struct cached_dev, then print out the offline device name won't reference
      bdev->hd_disk anymore. And this patch also avoids extra function calls
      of bdevname() and bio_devnmae().
      
      Changelog:
      v3, add Reviewed-by from Hannes.
      v2, call bdevname() earlier in register_bdev()
      v1, first version with segguestion from Junhui Tang.
      
      Fixes: c7b7bd07 ("bcache: add io_disable to struct cached_dev")
      Fixes: 5138ac67 ("bcache: fix misleading error message in bch_count_io_errors()")
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6e916a7e
  10. 19 3月, 2018 7 次提交
    • B
      bcache: Fix a compiler warning in bcache_device_init() · 5f2b18ec
      Bart Van Assche 提交于
      Avoid that building with W=1 triggers the following compiler warning:
      
      drivers/md/bcache/super.c:776:20: warning: comparison is always false due to limited range of data type [-Wtype-limits]
            d->nr_stripes > SIZE_MAX / sizeof(atomic_t)) {
                          ^
      Reviewed-by: NColy Li <colyli@suse.de>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5f2b18ec
    • C
      bcache: add io_disable to struct cached_dev · c7b7bd07
      Coly Li 提交于
      If a bcache device is configured to writeback mode, current code does not
      handle write I/O errors on backing devices properly.
      
      In writeback mode, write request is written to cache device, and
      latter being flushed to backing device. If I/O failed when writing from
      cache device to the backing device, bcache code just ignores the error and
      upper layer code is NOT noticed that the backing device is broken.
      
      This patch tries to handle backing device failure like how the cache device
      failure is handled,
      - Add a error counter 'io_errors' and error limit 'error_limit' in struct
        cached_dev. Add another io_disable to struct cached_dev to disable I/Os
        on the problematic backing device.
      - When I/O error happens on backing device, increase io_errors counter. And
        if io_errors reaches error_limit, set cache_dev->io_disable to true, and
        stop the bcache device.
      
      The result is, if backing device is broken of disconnected, and I/O errors
      reach its error limit, backing device will be disabled and the associated
      bcache device will be removed from system.
      
      Changelog:
      v2: remove "bcache: " prefix in pr_error(), and use correct name string to
          print out bcache device gendisk name.
      v1: indeed this is new added in v2 patch set.
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Cc: Michael Lyle <mlyle@lyle.org>
      Cc: Junhui Tang <tang.junhui@zte.com.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c7b7bd07
    • C
      bcache: add backing_request_endio() for bi_end_io · 27a40ab9
      Coly Li 提交于
      In order to catch I/O error of backing device, a separate bi_end_io
      call back is required. Then a per backing device counter can record I/O
      errors number and retire the backing device if the counter reaches a
      per backing device I/O error limit.
      
      This patch adds backing_request_endio() to bcache backing device I/O code
      path, this is a preparation for further complicated backing device failure
      handling. So far there is no real code logic change, I make this change a
      separate patch to make sure it is stable and reliable for further work.
      
      Changelog:
      v2: Fix code comments typo, remove a redundant bch_writeback_add() line
          added in v4 patch set.
      v1: indeed this is new added in this patch set.
      
      [mlyle: truncated commit subject]
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Cc: Junhui Tang <tang.junhui@zte.com.cn>
      Cc: Michael Lyle <mlyle@lyle.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      27a40ab9
    • C
      bcache: move closure debug file into debug directory · df2b9431
      Chengguang Xu 提交于
      In current code closure debug file is outside of debug directory
      and when unloading module there is lack of removing operation
      for closure debug file, so it will cause creating error when trying
      to reload  module.
      
      This patch move closure debug file into "bcache" debug direcory
      so that the file can get deleted properly.
      Signed-off-by: NChengguang Xu <cgxu519@gmx.com>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Reviewed-by: NTang Junhui <tang.junhui@zte.com.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      df2b9431
    • C
      bcache: add stop_when_cache_set_failed option to backing device · 7e027ca4
      Coly Li 提交于
      When there are too many I/O errors on cache device, current bcache code
      will retire the whole cache set, and detach all bcache devices. But the
      detached bcache devices are not stopped, which is problematic when bcache
      is in writeback mode.
      
      If the retired cache set has dirty data of backing devices, continue
      writing to bcache device will write to backing device directly. If the
      LBA of write request has a dirty version cached on cache device, next time
      when the cache device is re-registered and backing device re-attached to
      it again, the stale dirty data on cache device will be written to backing
      device, and overwrite latest directly written data. This situation causes
      a quite data corruption.
      
      But we cannot simply stop all attached bcache devices when the cache set is
      broken or disconnected. For example, use bcache to accelerate performance
      of an email service. In such workload, if cache device is broken but no
      dirty data lost, keep the bcache device alive and permit email service
      continue to access user data might be a better solution for the cache
      device failure.
      
      Nix <nix@esperi.org.uk> points out the issue and provides the above example
      to explain why it might be necessary to not stop bcache device for broken
      cache device. Pavel Goran <via-bcache@pvgoran.name> provides a brilliant
      suggestion to provide "always" and "auto" options to per-cached device
      sysfs file stop_when_cache_set_failed. If cache set is retiring and the
      backing device has no dirty data on cache, it should be safe to keep the
      bcache device alive. In this case, if stop_when_cache_set_failed is set to
      "auto", the device failure handling code will not stop this bcache device
      and permit application to access the backing device with a unattached
      bcache device.
      
      Changelog:
      [mlyle: edited to not break string constants across lines]
      v3: fix typos pointed out by Nix.
      v2: change option values of stop_when_cache_set_failed from 1/0 to
          "auto"/"always".
      v1: initial version, stop_when_cache_set_failed can be 0 (not stop) or 1
          (always stop).
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Signed-off-by: NMichael Lyle <mlyle@lyle.org>
      Cc: Nix <nix@esperi.org.uk>
      Cc: Pavel Goran <via-bcache@pvgoran.name>
      Cc: Junhui Tang <tang.junhui@zte.com.cn>
      Cc: Hannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7e027ca4
    • C
      bcache: add CACHE_SET_IO_DISABLE to struct cache_set flags · 771f393e
      Coly Li 提交于
      When too many I/Os failed on cache device, bch_cache_set_error() is called
      in the error handling code path to retire whole problematic cache set. If
      new I/O requests continue to come and take refcount dc->count, the cache
      set won't be retired immediately, this is a problem.
      
      Further more, there are several kernel thread and self-armed kernel work
      may still running after bch_cache_set_error() is called. It needs to wait
      quite a while for them to stop, or they won't stop at all. They also
      prevent the cache set from being retired.
      
      The solution in this patch is, to add per cache set flag to disable I/O
      request on this cache and all attached backing devices. Then new coming I/O
      requests can be rejected in *_make_request() before taking refcount, kernel
      threads and self-armed kernel worker can stop very fast when flags bit
      CACHE_SET_IO_DISABLE is set.
      
      Because bcache also do internal I/Os for writeback, garbage collection,
      bucket allocation, journaling, this kind of I/O should be disabled after
      bch_cache_set_error() is called. So closure_bio_submit() is modified to
      check whether CACHE_SET_IO_DISABLE is set on cache_set->flags. If set,
      closure_bio_submit() will set bio->bi_status to BLK_STS_IOERR and
      return, generic_make_request() won't be called.
      
      A sysfs interface is also added to set or clear CACHE_SET_IO_DISABLE bit
      from cache_set->flags, to disable or enable cache set I/O for debugging. It
      is helpful to trigger more corner case issues for failed cache device.
      
      Changelog
      v4, add wait_for_kthread_stop(), and call it before exits writeback and gc
          kernel threads.
      v3, change CACHE_SET_IO_DISABLE from 4 to 3, since it is bit index.
          remove "bcache: " prefix when printing out kernel message.
      v2, more changes by previous review,
      - Use CACHE_SET_IO_DISABLE of cache_set->flags, suggested by Junhui.
      - Check CACHE_SET_IO_DISABLE in bch_btree_gc() to stop a while-loop, this
        is reported and inspired from origal patch of Pavel Vazharov.
      v1, initial version.
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Cc: Junhui Tang <tang.junhui@zte.com.cn>
      Cc: Michael Lyle <mlyle@lyle.org>
      Cc: Pavel Vazharov <freakpv@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      771f393e
    • C
      bcache: stop dc->writeback_rate_update properly · 3fd47bfe
      Coly Li 提交于
      struct delayed_work writeback_rate_update in struct cache_dev is a delayed
      worker to call function update_writeback_rate() in period (the interval is
      defined by dc->writeback_rate_update_seconds).
      
      When a metadate I/O error happens on cache device, bcache error handling
      routine bch_cache_set_error() will call bch_cache_set_unregister() to
      retire whole cache set. On the unregister code path, this delayed work is
      stopped by calling cancel_delayed_work_sync(&dc->writeback_rate_update).
      
      dc->writeback_rate_update is a special delayed work from others in bcache.
      In its routine update_writeback_rate(), this delayed work is re-armed
      itself. That means when cancel_delayed_work_sync() returns, this delayed
      work can still be executed after several seconds defined by
      dc->writeback_rate_update_seconds.
      
      The problem is, after cancel_delayed_work_sync() returns, the cache set
      unregister code path will continue and release memory of struct cache set.
      Then the delayed work is scheduled to run, __update_writeback_rate()
      will reference the already released cache_set memory, and trigger a NULL
      pointer deference fault.
      
      This patch introduces two more bcache device flags,
      - BCACHE_DEV_WB_RUNNING
        bit set:  bcache device is in writeback mode and running, it is OK for
                  dc->writeback_rate_update to re-arm itself.
        bit clear:bcache device is trying to stop dc->writeback_rate_update,
                  this delayed work should not re-arm itself and quit.
      - BCACHE_DEV_RATE_DW_RUNNING
        bit set:  routine update_writeback_rate() is executing.
        bit clear: routine update_writeback_rate() quits.
      
      This patch also adds a function cancel_writeback_rate_update_dwork() to
      wait for dc->writeback_rate_update quits before cancel it by calling
      cancel_delayed_work_sync(). In order to avoid a deadlock by unexpected
      quit dc->writeback_rate_update, after time_out seconds this function will
      give up and continue to call cancel_delayed_work_sync().
      
      And here I explain how this patch stops self re-armed delayed work properly
      with the above stuffs.
      
      update_writeback_rate() sets BCACHE_DEV_RATE_DW_RUNNING at its beginning
      and clears BCACHE_DEV_RATE_DW_RUNNING at its end. Before calling
      cancel_writeback_rate_update_dwork() clear flag BCACHE_DEV_WB_RUNNING.
      
      Before calling cancel_delayed_work_sync() wait utill flag
      BCACHE_DEV_RATE_DW_RUNNING is clear. So when calling
      cancel_delayed_work_sync(), dc->writeback_rate_update must be already re-
      armed, or quite by seeing BCACHE_DEV_WB_RUNNING cleared. In both cases
      delayed work routine update_writeback_rate() won't be executed after
      cancel_delayed_work_sync() returns.
      
      Inside update_writeback_rate() before calling schedule_delayed_work(), flag
      BCACHE_DEV_WB_RUNNING is checked before. If this flag is cleared, it means
      someone is about to stop the delayed work. Because flag
      BCACHE_DEV_RATE_DW_RUNNING is set already and cancel_delayed_work_sync()
      has to wait for this flag to be cleared, we don't need to worry about race
      condition here.
      
      If update_writeback_rate() is scheduled to run after checking
      BCACHE_DEV_RATE_DW_RUNNING and before calling cancel_delayed_work_sync()
      in cancel_writeback_rate_update_dwork(), it is also safe. Because at this
      moment BCACHE_DEV_WB_RUNNING is cleared with memory barrier. As I mentioned
      previously, update_writeback_rate() will see BCACHE_DEV_WB_RUNNING is clear
      and quit immediately.
      
      Because there are more dependences inside update_writeback_rate() to struct
      cache_set memory, dc->writeback_rate_update is not a simple self re-arm
      delayed work. After trying many different methods (e.g. hold dc->count, or
      use locks), this is the only way I can find which works to properly stop
      dc->writeback_rate_update delayed work.
      
      Changelog:
      v3: change values of BCACHE_DEV_WB_RUNNING and BCACHE_DEV_RATE_DW_RUNNING
          to bit index, for test_bit().
      v2: Try to fix the race issue which is pointed out by Junhui.
      v1: The initial version for review
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NJunhui Tang <tang.junhui@zte.com.cn>
      Reviewed-by: NMichael Lyle <mlyle@lyle.org>
      Cc: Michael Lyle <mlyle@lyle.org>
      Cc: Hannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3fd47bfe