1. 31 5月, 2019 2 次提交
    • C
      bcache: return error immediately in bch_journal_replay() · 29b166da
      Coly Li 提交于
      [ Upstream commit 68d10e6979a3b59e3cd2e90bfcafed79c4cf180a ]
      
      When failure happens inside bch_journal_replay(), calling
      cache_set_err_on() and handling the failure in async way is not a good
      idea. Because after bch_journal_replay() returns, registering code will
      continue to execute following steps, and unregistering code triggered
      by cache_set_err_on() is running in same time. First it is unnecessary
      to handle failure and unregister cache set in an async way, second there
      might be potential race condition to run register and unregister code
      for same cache set.
      
      So in this patch, if failure happens in bch_journal_replay(), we don't
      call cache_set_err_on(), and just print out the same error message to
      kernel message buffer, then return -EIO immediately caller. Then caller
      can detect such failure and handle it in synchrnozied way.
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      29b166da
    • S
      bcache: avoid potential memleak of list of journal_replay(s) in the CACHE_SYNC... · 8034a6b8
      Shenghui Wang 提交于
      bcache: avoid potential memleak of list of journal_replay(s) in the CACHE_SYNC branch of run_cache_set
      
      [ Upstream commit 95f18c9d1310730d075499a75aaf13bcd60405a7 ]
      
      In the CACHE_SYNC branch of run_cache_set(), LIST_HEAD(journal) is used
      to collect journal_replay(s) and filled by bch_journal_read().
      
      If all goes well, bch_journal_replay() will release the list of
      jounal_replay(s) at the end of the branch.
      
      If something goes wrong, code flow will jump to the label "err:" and leave
      the list unreleased.
      
      This patch will release the list of journal_replay(s) in the case of
      error detected.
      
      v1 -> v2:
      * Move the release code to the location after label 'err:' to
        simply the change.
      Signed-off-by: NShenghui Wang <shhuiw@foxmail.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      8034a6b8
  2. 22 5月, 2019 2 次提交
    • C
      bcache: never set KEY_PTRS of journal key to 0 in journal_reclaim() · 88681649
      Coly Li 提交于
      commit 1bee2addc0c8470c8aaa65ef0599eeae96dd88bc upstream.
      
      In journal_reclaim() ja->cur_idx of each cache will be update to
      reclaim available journal buckets. Variable 'int n' is used to count how
      many cache is successfully reclaimed, then n is set to c->journal.key
      by SET_KEY_PTRS(). Later in journal_write_unlocked(), a for_each_cache()
      loop will write the jset data onto each cache.
      
      The problem is, if all jouranl buckets on each cache is full, the
      following code in journal_reclaim(),
      
      529 for_each_cache(ca, c, iter) {
      530       struct journal_device *ja = &ca->journal;
      531       unsigned int next = (ja->cur_idx + 1) % ca->sb.njournal_buckets;
      532
      533       /* No space available on this device */
      534       if (next == ja->discard_idx)
      535               continue;
      536
      537       ja->cur_idx = next;
      538       k->ptr[n++] = MAKE_PTR(0,
      539                         bucket_to_sector(c, ca->sb.d[ja->cur_idx]),
      540                         ca->sb.nr_this_dev);
      541 }
      542
      543 bkey_init(k);
      544 SET_KEY_PTRS(k, n);
      
      If there is no available bucket to reclaim, the if() condition at line
      534 will always true, and n remains 0. Then at line 544, SET_KEY_PTRS()
      will set KEY_PTRS field of c->journal.key to 0.
      
      Setting KEY_PTRS field of c->journal.key to 0 is wrong. Because in
      journal_write_unlocked() the journal data is written in following loop,
      
      649	for (i = 0; i < KEY_PTRS(k); i++) {
      650-671		submit journal data to cache device
      672	}
      
      If KEY_PTRS field is set to 0 in jouranl_reclaim(), the journal data
      won't be written to cache device here. If system crahed or rebooted
      before bkeys of the lost journal entries written into btree nodes, data
      corruption will be reported during bcache reload after rebooting the
      system.
      
      Indeed there is only one cache in a cache set, there is no need to set
      KEY_PTRS field in journal_reclaim() at all. But in order to keep the
      for_each_cache() logic consistent for now, this patch fixes the above
      problem by not setting 0 KEY_PTRS of journal key, if there is no bucket
      available to reclaim.
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      88681649
    • L
      bcache: fix a race between cache register and cacheset unregister · ecfc882f
      Liang Chen 提交于
      commit a4b732a248d12cbdb46999daf0bf288c011335eb upstream.
      
      There is a race between cache device register and cache set unregister.
      For an already registered cache device, register_bcache will call
      bch_is_open to iterate through all cachesets and check every cache
      there. The race occurs if cache_set_free executes at the same time and
      clears the caches right before ca is dereferenced in bch_is_open_cache.
      To close the race, let's make sure the clean up work is protected by
      the bch_register_lock as well.
      
      This issue can be reproduced as follows,
      while true; do echo /dev/XXX> /sys/fs/bcache/register ; done&
      while true; do echo 1> /sys/block/XXX/bcache/set/unregister ; done &
      
      and results in the following oops,
      
      [  +0.000053] BUG: unable to handle kernel NULL pointer dereference at 0000000000000998
      [  +0.000457] #PF error: [normal kernel read fault]
      [  +0.000464] PGD 800000003ca9d067 P4D 800000003ca9d067 PUD 3ca9c067 PMD 0
      [  +0.000388] Oops: 0000 [#1] SMP PTI
      [  +0.000269] CPU: 1 PID: 3266 Comm: bash Not tainted 5.0.0+ #6
      [  +0.000346] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
      [  +0.000472] RIP: 0010:register_bcache+0x1829/0x1990 [bcache]
      [  +0.000344] Code: b0 48 83 e8 50 48 81 fa e0 e1 10 c0 0f 84 a9 00 00 00 48 89 c6 48 89 ca 0f b7 ba 54 04 00 00 4c 8b 82 60 0c 00 00 85 ff 74 2f <49> 3b a8 98 09 00 00 74 4e 44 8d 47 ff 31 ff 49 c1 e0 03 eb 0d
      [  +0.000839] RSP: 0018:ffff92ee804cbd88 EFLAGS: 00010202
      [  +0.000328] RAX: ffffffffc010e190 RBX: ffff918b5c6b5000 RCX: ffff918b7d8e0000
      [  +0.000399] RDX: ffff918b7d8e0000 RSI: ffffffffc010e190 RDI: 0000000000000001
      [  +0.000398] RBP: ffff918b7d318340 R08: 0000000000000000 R09: ffffffffb9bd2d7a
      [  +0.000385] R10: ffff918b7eb253c0 R11: ffffb95980f51200 R12: ffffffffc010e1a0
      [  +0.000411] R13: fffffffffffffff2 R14: 000000000000000b R15: ffff918b7e232620
      [  +0.000384] FS:  00007f955bec2740(0000) GS:ffff918b7eb00000(0000) knlGS:0000000000000000
      [  +0.000420] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  +0.000801] CR2: 0000000000000998 CR3: 000000003cad6000 CR4: 00000000001406e0
      [  +0.000837] Call Trace:
      [  +0.000682]  ? _cond_resched+0x10/0x20
      [  +0.000691]  ? __kmalloc+0x131/0x1b0
      [  +0.000710]  kernfs_fop_write+0xfa/0x170
      [  +0.000733]  __vfs_write+0x2e/0x190
      [  +0.000688]  ? inode_security+0x10/0x30
      [  +0.000698]  ? selinux_file_permission+0xd2/0x120
      [  +0.000752]  ? security_file_permission+0x2b/0x100
      [  +0.000753]  vfs_write+0xa8/0x1a0
      [  +0.000676]  ksys_write+0x4d/0xb0
      [  +0.000699]  do_syscall_64+0x3a/0xf0
      [  +0.000692]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      Signed-off-by: NLiang Chen <liangchen.linux@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ecfc882f
  3. 06 4月, 2019 5 次提交
    • C
      bcache: fix potential div-zero error of writeback_rate_p_term_inverse · e7d26616
      Coly Li 提交于
      [ Upstream commit 5b5fd3c94eef69dcfaa8648198e54c92e5687d6d ]
      
      Current code already uses d_strtoul_nonzero() to convert input string
      to an unsigned integer, to make sure writeback_rate_p_term_inverse
      won't be zero value. But overflow may happen when converting input
      string to an unsigned integer value by d_strtoul_nonzero(), then
      dc->writeback_rate_p_term_inverse can still be set to 0 even if the
      sysfs file input value is not zero, e.g. 4294967296 (a.k.a UINT_MAX+1).
      
      If dc->writeback_rate_p_term_inverse is set to 0, it might cause a
      dev-zero error in following code from __update_writeback_rate(),
      	int64_t proportional_scaled =
      		div_s64(error, dc->writeback_rate_p_term_inverse);
      
      This patch replaces d_strtoul_nonzero() by sysfs_strtoul_clamp() and
      limit the value range in [1, UINT_MAX]. Then the unsigned integer
      overflow and dev-zero error can be avoided.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      e7d26616
    • C
      bcache: improve sysfs_strtoul_clamp() · 98eddc19
      Coly Li 提交于
      [ Upstream commit 596b5a5dd1bc2fa019fdaaae522ef331deef927f ]
      
      Currently sysfs_strtoul_clamp() is defined as,
       82 #define sysfs_strtoul_clamp(file, var, min, max)                   \
       83 do {                                                               \
       84         if (attr == &sysfs_ ## file)                               \
       85                 return strtoul_safe_clamp(buf, var, min, max)      \
       86                         ?: (ssize_t) size;                         \
       87 } while (0)
      
      The problem is, if bit width of var is less then unsigned long, min and
      max may not protect var from integer overflow, because overflow happens
      in strtoul_safe_clamp() before checking min and max.
      
      To fix such overflow in sysfs_strtoul_clamp(), to make min and max take
      effect, this patch adds an unsigned long variable, and uses it to macro
      strtoul_safe_clamp() to convert an unsigned long value in range defined
      by [min, max]. Then assign this value to var. By this method, if bit
      width of var is less than unsigned long, integer overflow won't happen
      before min and max are checking.
      
      Now sysfs_strtoul_clamp() can properly handle smaller data type like
      unsigned int, of cause min and max should be defined in range of
      unsigned int too.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      98eddc19
    • C
      bcache: fix potential div-zero error of writeback_rate_i_term_inverse · b468e000
      Coly Li 提交于
      [ Upstream commit c3b75a2199cdbfc1c335155fe143d842604b1baa ]
      
      dc->writeback_rate_i_term_inverse can be set via sysfs interface. It is
      in type unsigned int, and convert from input string by d_strtoul(). The
      problem is d_strtoul() does not check valid range of the input, if
      4294967296 is written into sysfs file writeback_rate_i_term_inverse,
      an overflow of unsigned integer will happen and value 0 is set to
      dc->writeback_rate_i_term_inverse.
      
      In writeback.c:__update_writeback_rate(), there are following lines of
      code,
            integral_scaled = div_s64(dc->writeback_rate_integral,
                            dc->writeback_rate_i_term_inverse);
      If dc->writeback_rate_i_term_inverse is set to 0 via sysfs interface,
      a div-zero error might be triggered in the above code.
      
      Therefore we need to add a range limitation in the sysfs interface,
      this is what this patch does, use sysfs_stroul_clamp() to replace
      d_strtoul() and restrict the input range in [1, UINT_MAX].
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      b468e000
    • C
      bcache: fix input overflow to sequential_cutoff · c7b687eb
      Coly Li 提交于
      [ Upstream commit 8c27a3953e92eb0b22dbb03d599f543a05f9574e ]
      
      People may set sequential_cutoff of a cached device via sysfs file,
      but current code does not check input value overflow. E.g. if value
      4294967295 (UINT_MAX) is written to file sequential_cutoff, its value
      is 4GB, but if 4294967296 (UINT_MAX + 1) is written into, its value
      will be 0. This is an unexpected behavior.
      
      This patch replaces d_strtoi_h() by sysfs_strtoul_clamp() to convert
      input string to unsigned integer value, and limit its range in
      [0, UINT_MAX]. Then the input overflow can be fixed.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      c7b687eb
    • C
      bcache: fix input overflow to cache set sysfs file io_error_halflife · 16975f04
      Coly Li 提交于
      [ Upstream commit a91fbda49f746119828f7e8ad0f0aa2ab0578f65 ]
      
      Cache set sysfs entry io_error_halflife is used to set c->error_decay.
      c->error_decay is in type unsigned int, and it is converted by
      strtoul_or_return(), therefore overflow to c->error_decay is possible
      for a large input value.
      
      This patch fixes the overflow by using strtoul_safe_clamp() to convert
      input string to an unsigned long value in range [0, UINT_MAX], then
      divides by 88 and set it to c->error_decay.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      16975f04
  4. 24 3月, 2019 2 次提交
    • C
      bcache: use (REQ_META|REQ_PRIO) to indicate bio for metadata · e578f90d
      Coly Li 提交于
      commit dc7292a5bcb4c878b076fca2ac3fc22f81b8f8df upstream.
      
      In 'commit 752f66a75aba ("bcache: use REQ_PRIO to indicate bio for
      metadata")' REQ_META is replaced by REQ_PRIO to indicate metadata bio.
      This assumption is not always correct, e.g. XFS uses REQ_META to mark
      metadata bio other than REQ_PRIO. This is why Nix noticed that bcache
      does not cache metadata for XFS after the above commit.
      
      Thanks to Dave Chinner, he explains the difference between REQ_META and
      REQ_PRIO from view of file system developer. Here I quote part of his
      explanation from mailing list,
         REQ_META is used for metadata. REQ_PRIO is used to communicate to
         the lower layers that the submitter considers this IO to be more
         important that non REQ_PRIO IO and so dispatch should be expedited.
      
         IOWs, if the filesystem considers metadata IO to be more important
         that user data IO, then it will use REQ_PRIO | REQ_META rather than
         just REQ_META.
      
      Then it seems bios with REQ_META or REQ_PRIO should both be cached for
      performance optimation, because they are all probably low I/O latency
      demand by upper layer (e.g. file system).
      
      So in this patch, when we want to decide whether to bypass the cache,
      REQ_META and REQ_PRIO are both checked. Then both metadata and
      high priority I/O requests will be handled properly.
      Reported-by: NNix <nix@esperi.org.uk>
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NAndre Noll <maan@tuebingen.mpg.de>
      Tested-by: NNix <nix@esperi.org.uk>
      Cc: stable@vger.kernel.org
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e578f90d
    • D
      bcache: never writeback a discard operation · 622afe5c
      Daniel Axtens 提交于
      commit 9951379b0ca88c95876ad9778b9099e19a95d566 upstream.
      
      Some users see panics like the following when performing fstrim on a
      bcached volume:
      
      [  529.803060] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
      [  530.183928] #PF error: [normal kernel read fault]
      [  530.412392] PGD 8000001f42163067 P4D 8000001f42163067 PUD 1f42168067 PMD 0
      [  530.750887] Oops: 0000 [#1] SMP PTI
      [  530.920869] CPU: 10 PID: 4167 Comm: fstrim Kdump: loaded Not tainted 5.0.0-rc1+ #3
      [  531.290204] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 12/27/2015
      [  531.693137] RIP: 0010:blk_queue_split+0x148/0x620
      [  531.922205] Code: 60 38 89 55 a0 45 31 db 45 31 f6 45 31 c9 31 ff 89 4d 98 85 db 0f 84 7f 04 00 00 44 8b 6d 98 4c 89 ee 48 c1 e6 04 49 03 70 78 <8b> 46 08 44 8b 56 0c 48
      8b 16 44 29 e0 39 d8 48 89 55 a8 0f 47 c3
      [  532.838634] RSP: 0018:ffffb9b708df39b0 EFLAGS: 00010246
      [  533.093571] RAX: 00000000ffffffff RBX: 0000000000046000 RCX: 0000000000000000
      [  533.441865] RDX: 0000000000000200 RSI: 0000000000000000 RDI: 0000000000000000
      [  533.789922] RBP: ffffb9b708df3a48 R08: ffff940d3b3fdd20 R09: 0000000000000000
      [  534.137512] R10: ffffb9b708df3958 R11: 0000000000000000 R12: 0000000000000000
      [  534.485329] R13: 0000000000000000 R14: 0000000000000000 R15: ffff940d39212020
      [  534.833319] FS:  00007efec26e3840(0000) GS:ffff940d1f480000(0000) knlGS:0000000000000000
      [  535.224098] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  535.504318] CR2: 0000000000000008 CR3: 0000001f4e256004 CR4: 00000000001606e0
      [  535.851759] Call Trace:
      [  535.970308]  ? mempool_alloc_slab+0x15/0x20
      [  536.174152]  ? bch_data_insert+0x42/0xd0 [bcache]
      [  536.403399]  blk_mq_make_request+0x97/0x4f0
      [  536.607036]  generic_make_request+0x1e2/0x410
      [  536.819164]  submit_bio+0x73/0x150
      [  536.980168]  ? submit_bio+0x73/0x150
      [  537.149731]  ? bio_associate_blkg_from_css+0x3b/0x60
      [  537.391595]  ? _cond_resched+0x1a/0x50
      [  537.573774]  submit_bio_wait+0x59/0x90
      [  537.756105]  blkdev_issue_discard+0x80/0xd0
      [  537.959590]  ext4_trim_fs+0x4a9/0x9e0
      [  538.137636]  ? ext4_trim_fs+0x4a9/0x9e0
      [  538.324087]  ext4_ioctl+0xea4/0x1530
      [  538.497712]  ? _copy_to_user+0x2a/0x40
      [  538.679632]  do_vfs_ioctl+0xa6/0x600
      [  538.853127]  ? __do_sys_newfstat+0x44/0x70
      [  539.051951]  ksys_ioctl+0x6d/0x80
      [  539.212785]  __x64_sys_ioctl+0x1a/0x20
      [  539.394918]  do_syscall_64+0x5a/0x110
      [  539.568674]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      We have observed it where both:
      1) LVM/devmapper is involved (bcache backing device is LVM volume) and
      2) writeback cache is involved (bcache cache_mode is writeback)
      
      On one machine, we can reliably reproduce it with:
      
       # echo writeback > /sys/block/bcache0/bcache/cache_mode
         (not sure whether above line is required)
       # mount /dev/bcache0 /test
       # for i in {0..10}; do
      	file="$(mktemp /test/zero.XXX)"
      	dd if=/dev/zero of="$file" bs=1M count=256
      	sync
      	rm $file
          done
        # fstrim -v /test
      
      Observing this with tracepoints on, we see the following writes:
      
      fstrim-18019 [022] .... 91107.302026: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 4260112 + 196352 hit 0 bypass 1
      fstrim-18019 [022] .... 91107.302050: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 4456464 + 262144 hit 0 bypass 1
      fstrim-18019 [022] .... 91107.302075: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 4718608 + 81920 hit 0 bypass 1
      fstrim-18019 [022] .... 91107.302094: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 5324816 + 180224 hit 0 bypass 1
      fstrim-18019 [022] .... 91107.302121: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 5505040 + 262144 hit 0 bypass 1
      fstrim-18019 [022] .... 91107.302145: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 5767184 + 81920 hit 0 bypass 1
      fstrim-18019 [022] .... 91107.308777: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 6373392 + 180224 hit 1 bypass 0
      <crash>
      
      Note the final one has different hit/bypass flags.
      
      This is because in should_writeback(), we were hitting a case where
      the partial stripe condition was returning true and so
      should_writeback() was returning true early.
      
      If that hadn't been the case, it would have hit the would_skip test, and
      as would_skip == s->iop.bypass == true, should_writeback() would have
      returned false.
      
      Looking at the git history from 'commit 72c27061 ("bcache: Write out
      full stripes")', it looks like the idea was to optimise for raid5/6:
      
             * If a stripe is already dirty, force writes to that stripe to
      	 writeback mode - to help build up full stripes of dirty data
      
      To fix this issue, make sure that should_writeback() on a discard op
      never returns true.
      
      More details of debugging:
      https://www.spinics.net/lists/linux-bcache/msg06996.html
      
      Previous reports:
       - https://bugzilla.kernel.org/show_bug.cgi?id=201051
       - https://bugzilla.kernel.org/show_bug.cgi?id=196103
       - https://www.spinics.net/lists/linux-bcache/msg06885.html
      
      (Coly Li: minor modification to follow maximum 75 chars per line rule)
      
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: stable@vger.kernel.org
      Fixes: 72c27061 ("bcache: Write out full stripes")
      Signed-off-by: NDaniel Axtens <dja@axtens.net>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      622afe5c
  5. 14 11月, 2018 5 次提交
  6. 27 9月, 2018 1 次提交
    • G
      bcache: add separate workqueue for journal_write to avoid deadlock · 0f843e65
      Guoju Fang 提交于
      After write SSD completed, bcache schedules journal_write work to
      system_wq, which is a public workqueue in system, without WQ_MEM_RECLAIM
      flag. system_wq is also a bound wq, and there may be no idle kworker on
      current processor. Creating a new kworker may unfortunately need to
      reclaim memory first, by shrinking cache and slab used by vfs, which
      depends on bcache device. That's a deadlock.
      
      This patch create a new workqueue for journal_write with WQ_MEM_RECLAIM
      flag. It's rescuer thread will work to avoid the deadlock.
      Signed-off-by: NGuoju Fang <fangguoju@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0f843e65
  7. 23 8月, 2018 2 次提交
  8. 12 8月, 2018 17 次提交
  9. 11 8月, 2018 1 次提交
    • C
      bcache: fix error setting writeback_rate through sysfs interface · 46451874
      Coly Li 提交于
      Commit ea8c5356 ("bcache: set max writeback rate when I/O request
      is idle") changes struct bch_ratelimit member rate from uint32_t to
      atomic_long_t and uses atomic_long_set() in drivers/md/bcache/sysfs.c
      to set new writeback rate, after the input is converted from memory
      buf to long int by sysfs_strtoul_clamp().
      
      The above change has a problem because there is an implicit return
      inside sysfs_strtoul_clamp() so the following atomic_long_set()
      won't be called. This error is detected by 0day system with following
      snipped smatch warnings:
      
      drivers/md/bcache/sysfs.c:271 __cached_dev_store() error: uninitialized
      symbol 'v'.
      270  sysfs_strtoul_clamp(writeback_rate, v, 1, INT_MAX);
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      @271 atomic_long_set(&dc->writeback_rate.rate, v);
      
      This patch fixes the above error by using strtoul_safe_clamp() to
      convert the input buffer into a long int type result.
      
      Fixes: ea8c5356 ("bcache: set max writeback rate when I/O request is idle")
      Cc: Kai Krakow <kai@kaishome.de>
      Cc: Stefan Priebe <s.priebe@profihost.ag>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      46451874
  10. 09 8月, 2018 3 次提交
    • S
      bcache: trivial - remove tailing backslash in macro BTREE_FLAG · cbb751c0
      Shenghui Wang 提交于
      Remove the tailing backslash in macro BTREE_FLAG in btree.h
      Signed-off-by: NShenghui Wang <shhuiw@foxmail.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      cbb751c0
    • S
      bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section · e921efeb
      Shenghui Wang 提交于
      The pr_err statement in the code for sysfs_attatch section would run
      for various error codes, which maybe confusing.
      
      E.g,
      
      Run the command twice:
         echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be891 > \
      				/sys/block/bcache0/bcache/attach
         [the backing dev got attached on the first run]
         echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be891 > \
      				/sys/block/bcache0/bcache/attach
      
      In dmesg, after the command run twice, we can get:
      	bcache: bch_cached_dev_attach() Can't attach sda6: already attached
      	bcache: __cached_dev_store() Can't attach 796b5c05-b03c-4bc7-9cbd-\
      a8df5e8be891
                     : cache set not found
      The first statement in the message was right, but the second was
      confusing.
      
      bch_cached_dev_attach has various pr_ statements for various error
      codes, except ENOENT.
      
      After the change, rerun above command twice:
      	echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be891 > \
      			/sys/block/bcache0/bcache/attach
      	echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be891 > \
      			/sys/block/bcache0/bcache/attach
      
      In dmesg we only got:
      	bcache: bch_cached_dev_attach() Can't attach sda6: already attached
      No confusing "cache set not found" message anymore.
      
      And for some not exist SET-UUID:
      	echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be898 > \
      			/sys/block/bcache0/bcache/attach
      In dmesg we can get:
      	bcache: __cached_dev_store() Can't attach 796b5c05-b03c-4bc7-9cbd-\
      a8df5e8be898
      	               : cache set not found
      Signed-off-by: NShenghui Wang <shhuiw@foxmail.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e921efeb
    • C
      bcache: set max writeback rate when I/O request is idle · ea8c5356
      Coly Li 提交于
      Commit b1092c9a ("bcache: allow quick writeback when backing idle")
      allows the writeback rate to be faster if there is no I/O request on a
      bcache device. It works well if there is only one bcache device attached
      to the cache set. If there are many bcache devices attached to a cache
      set, it may introduce performance regression because multiple faster
      writeback threads of the idle bcache devices will compete the btree level
      locks with the bcache device who have I/O requests coming.
      
      This patch fixes the above issue by only permitting fast writebac when
      all bcache devices attached on the cache set are idle. And if one of the
      bcache devices has new I/O request coming, minimized all writeback
      throughput immediately and let PI controller __update_writeback_rate()
      to decide the upcoming writeback rate for each bcache device.
      
      Also when all bcache devices are idle, limited wrieback rate to a small
      number is wast of thoughput, especially when backing devices are slower
      non-rotation devices (e.g. SATA SSD). This patch sets a max writeback
      rate for each backing device if the whole cache set is idle. A faster
      writeback rate in idle time means new I/Os may have more available space
      for dirty data, and people may observe a better write performance then.
      
      Please note bcache may change its cache mode in run time, and this patch
      still works if the cache mode is switched from writeback mode and there
      is still dirty data on cache.
      
      Fixes: Commit b1092c9a ("bcache: allow quick writeback when backing idle")
      Cc: stable@vger.kernel.org #4.16+
      Signed-off-by: NColy Li <colyli@suse.de>
      Tested-by: NKai Krakow <kai@kaishome.de>
      Tested-by: NStefan Priebe <s.priebe@profihost.ag>
      Cc: Michael Lyle <mlyle@lyle.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ea8c5356