1. 25 4月, 2019 11 次提交
    • C
      bcache: return error immediately in bch_journal_replay() · 68d10e69
      Coly Li 提交于
      When failure happens inside bch_journal_replay(), calling
      cache_set_err_on() and handling the failure in async way is not a good
      idea. Because after bch_journal_replay() returns, registering code will
      continue to execute following steps, and unregistering code triggered
      by cache_set_err_on() is running in same time. First it is unnecessary
      to handle failure and unregister cache set in an async way, second there
      might be potential race condition to run register and unregister code
      for same cache set.
      
      So in this patch, if failure happens in bch_journal_replay(), we don't
      call cache_set_err_on(), and just print out the same error message to
      kernel message buffer, then return -EIO immediately caller. Then caller
      can detect such failure and handle it in synchrnozied way.
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      68d10e69
    • C
      bcache: add comments for kobj release callback routine · 2d17456e
      Coly Li 提交于
      Bcache has several routines to release resources in implicit way, they
      are called when the associated kobj released. This patch adds code
      comments to notice when and which release callback will be called,
      - When dc->disk.kobj released:
        void bch_cached_dev_release(struct kobject *kobj)
      - When d->kobj released:
        void bch_flash_dev_release(struct kobject *kobj)
      - When c->kobj released:
        void bch_cache_set_release(struct kobject *kobj)
      - When ca->kobj released
        void bch_cache_release(struct kobject *kobj)
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2d17456e
    • C
      bcache: add failure check to run_cache_set() for journal replay · ce3e4cfb
      Coly Li 提交于
      Currently run_cache_set() has no return value, if there is failure in
      bch_journal_replay(), the caller of run_cache_set() has no idea about
      such failure and just continue to execute following code after
      run_cache_set().  The internal failure is triggered inside
      bch_journal_replay() and being handled in async way. This behavior is
      inefficient, while failure handling inside bch_journal_replay(), cache
      register code is still running to start the cache set. Registering and
      unregistering code running as same time may introduce some rare race
      condition, and make the code to be more hard to be understood.
      
      This patch adds return value to run_cache_set(), and returns -EIO if
      bch_journal_rreplay() fails. Then caller of run_cache_set() may detect
      such failure and stop registering code flow immedidately inside
      register_cache_set().
      
      If journal replay fails, run_cache_set() can report error immediately
      to register_cache_set(). This patch makes the failure handling for
      bch_journal_replay() be in synchronized way, easier to understand and
      debug, and avoid poetential race condition for register-and-unregister
      in same time.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ce3e4cfb
    • C
      bcache: never set KEY_PTRS of journal key to 0 in journal_reclaim() · 1bee2add
      Coly Li 提交于
      In journal_reclaim() ja->cur_idx of each cache will be update to
      reclaim available journal buckets. Variable 'int n' is used to count how
      many cache is successfully reclaimed, then n is set to c->journal.key
      by SET_KEY_PTRS(). Later in journal_write_unlocked(), a for_each_cache()
      loop will write the jset data onto each cache.
      
      The problem is, if all jouranl buckets on each cache is full, the
      following code in journal_reclaim(),
      
      529 for_each_cache(ca, c, iter) {
      530       struct journal_device *ja = &ca->journal;
      531       unsigned int next = (ja->cur_idx + 1) % ca->sb.njournal_buckets;
      532
      533       /* No space available on this device */
      534       if (next == ja->discard_idx)
      535               continue;
      536
      537       ja->cur_idx = next;
      538       k->ptr[n++] = MAKE_PTR(0,
      539                         bucket_to_sector(c, ca->sb.d[ja->cur_idx]),
      540                         ca->sb.nr_this_dev);
      541 }
      542
      543 bkey_init(k);
      544 SET_KEY_PTRS(k, n);
      
      If there is no available bucket to reclaim, the if() condition at line
      534 will always true, and n remains 0. Then at line 544, SET_KEY_PTRS()
      will set KEY_PTRS field of c->journal.key to 0.
      
      Setting KEY_PTRS field of c->journal.key to 0 is wrong. Because in
      journal_write_unlocked() the journal data is written in following loop,
      
      649	for (i = 0; i < KEY_PTRS(k); i++) {
      650-671		submit journal data to cache device
      672	}
      
      If KEY_PTRS field is set to 0 in jouranl_reclaim(), the journal data
      won't be written to cache device here. If system crahed or rebooted
      before bkeys of the lost journal entries written into btree nodes, data
      corruption will be reported during bcache reload after rebooting the
      system.
      
      Indeed there is only one cache in a cache set, there is no need to set
      KEY_PTRS field in journal_reclaim() at all. But in order to keep the
      for_each_cache() logic consistent for now, this patch fixes the above
      problem by not setting 0 KEY_PTRS of journal key, if there is no bucket
      available to reclaim.
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1bee2add
    • C
      bcache: move definition of 'int ret' out of macro read_bucket() · 14215ee0
      Coly Li 提交于
      'int ret' is defined as a local variable inside macro read_bucket().
      Since this macro is called multiple times, and following patches will
      use a 'int ret' variable in bch_journal_read(), this patch moves
      definition of 'int ret' from macro read_bucket() to range of function
      bch_journal_read().
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      14215ee0
    • L
      bcache: fix a race between cache register and cacheset unregister · a4b732a2
      Liang Chen 提交于
      There is a race between cache device register and cache set unregister.
      For an already registered cache device, register_bcache will call
      bch_is_open to iterate through all cachesets and check every cache
      there. The race occurs if cache_set_free executes at the same time and
      clears the caches right before ca is dereferenced in bch_is_open_cache.
      To close the race, let's make sure the clean up work is protected by
      the bch_register_lock as well.
      
      This issue can be reproduced as follows,
      while true; do echo /dev/XXX> /sys/fs/bcache/register ; done&
      while true; do echo 1> /sys/block/XXX/bcache/set/unregister ; done &
      
      and results in the following oops,
      
      [  +0.000053] BUG: unable to handle kernel NULL pointer dereference at 0000000000000998
      [  +0.000457] #PF error: [normal kernel read fault]
      [  +0.000464] PGD 800000003ca9d067 P4D 800000003ca9d067 PUD 3ca9c067 PMD 0
      [  +0.000388] Oops: 0000 [#1] SMP PTI
      [  +0.000269] CPU: 1 PID: 3266 Comm: bash Not tainted 5.0.0+ #6
      [  +0.000346] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
      [  +0.000472] RIP: 0010:register_bcache+0x1829/0x1990 [bcache]
      [  +0.000344] Code: b0 48 83 e8 50 48 81 fa e0 e1 10 c0 0f 84 a9 00 00 00 48 89 c6 48 89 ca 0f b7 ba 54 04 00 00 4c 8b 82 60 0c 00 00 85 ff 74 2f <49> 3b a8 98 09 00 00 74 4e 44 8d 47 ff 31 ff 49 c1 e0 03 eb 0d
      [  +0.000839] RSP: 0018:ffff92ee804cbd88 EFLAGS: 00010202
      [  +0.000328] RAX: ffffffffc010e190 RBX: ffff918b5c6b5000 RCX: ffff918b7d8e0000
      [  +0.000399] RDX: ffff918b7d8e0000 RSI: ffffffffc010e190 RDI: 0000000000000001
      [  +0.000398] RBP: ffff918b7d318340 R08: 0000000000000000 R09: ffffffffb9bd2d7a
      [  +0.000385] R10: ffff918b7eb253c0 R11: ffffb95980f51200 R12: ffffffffc010e1a0
      [  +0.000411] R13: fffffffffffffff2 R14: 000000000000000b R15: ffff918b7e232620
      [  +0.000384] FS:  00007f955bec2740(0000) GS:ffff918b7eb00000(0000) knlGS:0000000000000000
      [  +0.000420] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  +0.000801] CR2: 0000000000000998 CR3: 000000003cad6000 CR4: 00000000001406e0
      [  +0.000837] Call Trace:
      [  +0.000682]  ? _cond_resched+0x10/0x20
      [  +0.000691]  ? __kmalloc+0x131/0x1b0
      [  +0.000710]  kernfs_fop_write+0xfa/0x170
      [  +0.000733]  __vfs_write+0x2e/0x190
      [  +0.000688]  ? inode_security+0x10/0x30
      [  +0.000698]  ? selinux_file_permission+0xd2/0x120
      [  +0.000752]  ? security_file_permission+0x2b/0x100
      [  +0.000753]  vfs_write+0xa8/0x1a0
      [  +0.000676]  ksys_write+0x4d/0xb0
      [  +0.000699]  do_syscall_64+0x3a/0xf0
      [  +0.000692]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      Signed-off-by: NLiang Chen <liangchen.linux@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a4b732a2
    • G
      bcache: Clean up bch_get_congested() · 3a394727
      George Spelvin 提交于
      There are a few nits in this function.  They could in theory all
      be separate patches, but that's probably taking small commits
      too far.
      
      1) I added a brief comment saying what it does.
      
      2) I like to declare pointer parameters "const" where possible
         for documentation reasons.
      
      3) It uses bitmap_weight(&rand, BITS_PER_LONG) to compute the Hamming
      weight of a 32-bit random number (giving a random integer with
      mean 16 and variance 8).  Passing by reference in a 64-bit variable
      is silly; just use hweight32().
      
      4) Its helper function fract_exp_two is unnecessarily tangled.
      Gcc can optimize the multiply by (1 << x) to a shift, but it can
      be written in a much more straightforward way at the cost of one
      more bit of internal precision.  Some analysis reveals that this
      bit is always available.
      
      This shrinks the object code for fract_exp_two(x, 6) from 23 bytes:
      
      0000000000000000 <foo1>:
         0:   89 f9                   mov    %edi,%ecx
         2:   c1 e9 06                shr    $0x6,%ecx
         5:   b8 01 00 00 00          mov    $0x1,%eax
         a:   d3 e0                   shl    %cl,%eax
         c:   83 e7 3f                and    $0x3f,%edi
         f:   d3 e7                   shl    %cl,%edi
        11:   c1 ef 06                shr    $0x6,%edi
        14:   01 f8                   add    %edi,%eax
        16:   c3                      retq
      
      To 19:
      
      0000000000000017 <foo2>:
        17:   89 f8                   mov    %edi,%eax
        19:   83 e0 3f                and    $0x3f,%eax
        1c:   83 c0 40                add    $0x40,%eax
        1f:   89 f9                   mov    %edi,%ecx
        21:   c1 e9 06                shr    $0x6,%ecx
        24:   d3 e0                   shl    %cl,%eax
        26:   c1 e8 06                shr    $0x6,%eax
        29:   c3                      retq
      
      (Verified with 0 <= frac_bits <= 8, 0 <= x < 16<<frac_bits;
      both versions produce the same output.)
      
      5) And finally, the call to bch_get_congested() in check_should_bypass()
      is separated from the use of the value by multiple tests which
      could moot the need to compute it.  Move the computation down to
      where it's needed.  This also saves a local register to hold the
      computed value.
      Signed-off-by: NGeorge Spelvin <lkml@sdf.org>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3a394727
    • G
      bcache: use kmemdup_nul for CACHED_LABEL buffer · 792732d9
      Geliang Tang 提交于
      This patch uses kmemdup_nul to create a NUL-terminated string from
      dc->sb.label. This is better than open coding it.
      
      With this, we can move env[2] initialization into env[] array to make
      code more elegant.
      Signed-off-by: NGeliang Tang <geliangtang@gmail.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      792732d9
    • A
      bcache: avoid clang -Wunintialized warning · 78d4eb8a
      Arnd Bergmann 提交于
      clang has identified a code path in which it thinks a
      variable may be unused:
      
      drivers/md/bcache/alloc.c:333:4: error: variable 'bucket' is used uninitialized whenever 'if' condition is false
            [-Werror,-Wsometimes-uninitialized]
                              fifo_pop(&ca->free_inc, bucket);
                              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      drivers/md/bcache/util.h:219:27: note: expanded from macro 'fifo_pop'
       #define fifo_pop(fifo, i)       fifo_pop_front(fifo, (i))
                                      ^~~~~~~~~~~~~~~~~~~~~~~~~
      drivers/md/bcache/util.h:189:6: note: expanded from macro 'fifo_pop_front'
              if (_r) {                                                       \
                  ^~
      drivers/md/bcache/alloc.c:343:46: note: uninitialized use occurs here
                              allocator_wait(ca, bch_allocator_push(ca, bucket));
                                                                        ^~~~~~
      drivers/md/bcache/alloc.c:287:7: note: expanded from macro 'allocator_wait'
                      if (cond)                                               \
                          ^~~~
      drivers/md/bcache/alloc.c:333:4: note: remove the 'if' if its condition is always true
                              fifo_pop(&ca->free_inc, bucket);
                              ^
      drivers/md/bcache/util.h:219:27: note: expanded from macro 'fifo_pop'
       #define fifo_pop(fifo, i)       fifo_pop_front(fifo, (i))
                                      ^
      drivers/md/bcache/util.h:189:2: note: expanded from macro 'fifo_pop_front'
              if (_r) {                                                       \
              ^
      drivers/md/bcache/alloc.c:331:15: note: initialize the variable 'bucket' to silence this warning
                              long bucket;
                                         ^
      
      This cannot happen in practice because we only enter the loop
      if there is at least one element in the list.
      
      Slightly rearranging the code makes this clearer to both the
      reader and the compiler, which avoids the warning.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Reviewed-by: NNathan Chancellor <natechancellor@gmail.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      78d4eb8a
    • G
      bcache: fix inaccurate result of unused buckets · 4e0c04ec
      Guoju Fang 提交于
      To get the amount of unused buckets in sysfs_priority_stats, the code
      count the buckets which GC_SECTORS_USED is zero. It's correct and should
      not be overwritten by the count of buckets which prio is zero.
      Signed-off-by: NGuoju Fang <fangguoju@gmail.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4e0c04ec
    • G
      bcache: fix crashes stopping bcache device before read miss done · 1568ee7e
      Guoju Fang 提交于
      The bio from upper layer is considered completed when bio_complete()
      returns. In most scenarios bio_complete() is called in search_free(),
      but when read miss happens, the bio_compete() is called when backing
      device reading completed, while the struct search is still in use until
      cache inserting finished.
      
      If someone stops the bcache device just then, the device may be closed
      and released, but after cache inserting finished the struct search will
      access a freed struct cached_dev.
      
      This patch add the reference of bcache device before bio_complete() when
      read miss happens, and put it after the search is not used.
      Signed-off-by: NGuoju Fang <fangguoju@gmail.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1568ee7e
  2. 17 4月, 2019 3 次提交
  3. 11 4月, 2019 8 次提交
  4. 07 4月, 2019 1 次提交
    • C
      block: remove CONFIG_LBDAF · 72deb455
      Christoph Hellwig 提交于
      Currently support for 64-bit sector_t and blkcnt_t is optional on 32-bit
      architectures.  These types are required to support block device and/or
      file sizes larger than 2 TiB, and have generally defaulted to on for
      a long time.  Enabling the option only increases the i386 tinyconfig
      size by 145 bytes, and many data structures already always use
      64-bit values for their in-core and on-disk data structures anyway,
      so there should not be a large change in dynamic memory usage either.
      
      Dropping this option removes a somewhat weird non-default config that
      has cause various bugs or compiler warnings when actually used.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      72deb455
  5. 06 4月, 2019 1 次提交
    • M
      dm integrity: fix deadlock with overlapping I/O · 4ed319c6
      Mikulas Patocka 提交于
      dm-integrity will deadlock if overlapping I/O is issued to it, the bug
      was introduced by commit 724376a0 ("dm integrity: implement fair
      range locks").  Users rarely use overlapping I/O so this bug went
      undetected until now.
      
      Fix this bug by correcting, likely cut-n-paste, typos in
      ranges_overlap() and also remove a flawed ranges_overlap() check in
      remove_range_unlocked().  This condition could leave unprocessed bios
      hanging on wait_list forever.
      
      Cc: stable@vger.kernel.org # v4.19+
      Fixes: 724376a0 ("dm integrity: implement fair range locks")
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      4ed319c6
  6. 05 4月, 2019 1 次提交
    • M
      dm: disable DISCARD if the underlying storage no longer supports it · bcb44433
      Mike Snitzer 提交于
      Storage devices which report supporting discard commands like
      WRITE_SAME_16 with unmap, but reject discard commands sent to the
      storage device.  This is a clear storage firmware bug but it doesn't
      change the fact that should a program cause discards to be sent to a
      multipath device layered on this buggy storage, all paths can end up
      failed at the same time from the discards, causing possible I/O loss.
      
      The first discard to a path will fail with Illegal Request, Invalid
      field in cdb, e.g.:
       kernel: sd 8:0:8:19: [sdfn] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
       kernel: sd 8:0:8:19: [sdfn] tag#0 Sense Key : Illegal Request [current]
       kernel: sd 8:0:8:19: [sdfn] tag#0 Add. Sense: Invalid field in cdb
       kernel: sd 8:0:8:19: [sdfn] tag#0 CDB: Write same(16) 93 08 00 00 00 00 00 a0 08 00 00 00 80 00 00 00
       kernel: blk_update_request: critical target error, dev sdfn, sector 10487808
      
      The SCSI layer converts this to the BLK_STS_TARGET error number, the sd
      device disables its support for discard on this path, and because of the
      BLK_STS_TARGET error multipath fails the discard without failing any
      path or retrying down a different path.  But subsequent discards can
      cause path failures.  Any discards sent to the path which already failed
      a discard ends up failing with EIO from blk_cloned_rq_check_limits with
      an "over max size limit" error since the discard limit was set to 0 by
      the sd driver for the path.  As the error is EIO, this now fails the
      path and multipath tries to send the discard down the next path.  This
      cycle continues as discards are sent until all paths fail.
      
      Fix this by training DM core to disable DISCARD if the underlying
      storage already did so.
      
      Also, fix branching in dm_done() and clone_endio() to reflect the
      mutually exclussive nature of the IO operations in question.
      
      Cc: stable@vger.kernel.org
      Reported-by: NDavid Jeffery <djeffery@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      bcb44433
  7. 02 4月, 2019 8 次提交
    • I
      dm table: propagate BDI_CAP_STABLE_WRITES to fix sporadic checksum errors · eb40c0ac
      Ilya Dryomov 提交于
      Some devices don't use blk_integrity but still want stable pages
      because they do their own checksumming.  Examples include rbd and iSCSI
      when data digests are negotiated.  Stacking DM (and thus LVM) on top of
      these devices results in sporadic checksum errors.
      
      Set BDI_CAP_STABLE_WRITES if any underlying device has it set.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      eb40c0ac
    • M
      dm: revert 8f50e358 ("dm: limit the max bio size as BIO_MAX_PAGES * PAGE_SIZE") · 75ae1936
      Mikulas Patocka 提交于
      The limit was already incorporated to dm-crypt with commit 4e870e94
      ("dm crypt: fix error with too large bios"), so we don't need to apply
      it globally to all targets. The quantity BIO_MAX_PAGES * PAGE_SIZE is
      wrong anyway because the variable ti->max_io_len it is supposed to be in
      the units of 512-byte sectors not in bytes.
      
      Reduction of the limit to 1048576 sectors could even cause data
      corruption in rare cases - suppose that we have a dm-striped device with
      stripe size 768MiB. The target will call dm_set_target_max_io_len with
      the value 1572864. The buggy code would reduce it to 1048576. Now, the
      dm-core will errorneously split the bios on 1048576-sector boundary
      insetad of 1572864-sector boundary and pass these stripe-crossing bios
      to the striped target.
      
      Cc: stable@vger.kernel.org # v4.16+
      Fixes: 8f50e358 ("dm: limit the max bio size as BIO_MAX_PAGES * PAGE_SIZE")
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Acked-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      75ae1936
    • A
      dm init: fix const confusion for dm_allowed_targets array · 93fc9167
      Andi Kleen 提交于
      A non const pointer to const cannot be marked initconst.
      Mark the array actually const.
      
      Fixes: 6bbc923d dm: add support to directly boot to a mapped device
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      93fc9167
    • Y
      dm integrity: make dm_integrity_init and dm_integrity_exit static · 5efedc9b
      YueHaibing 提交于
      Fix sparse warnings:
      
      drivers/md/dm-integrity.c:3619:12: warning:
       symbol 'dm_integrity_init' was not declared. Should it be static?
      drivers/md/dm-integrity.c:3638:6: warning:
       symbol 'dm_integrity_exit' was not declared. Should it be static?
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      5efedc9b
    • M
      dm integrity: change memcmp to strncmp in dm_integrity_ctr · 0d74e6a3
      Mikulas Patocka 提交于
      If the string opt_string is small, the function memcmp can access bytes
      that are beyond the terminating nul character. In theory, it could cause
      segfault, if opt_string were located just below some unmapped memory.
      
      Change from memcmp to strncmp so that we don't read bytes beyond the end
      of the string.
      
      Cc: stable@vger.kernel.org # v4.12+
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      0d74e6a3
    • N
      md: batch flush requests. · 2bc13b83
      NeilBrown 提交于
      Currently if many flush requests are submitted to an md device is quick
      succession, they are serialized and can take a long to process them all.
      We don't really need to call flush all those times - a single flush call
      can satisfy all requests submitted before it started.
      So keep track of when the current flush started and when it finished,
      allow any pending flush that was requested before the flush started
      to complete without waiting any more.
      
      Test results from Xiao:
      
      Test is done on a raid10 device which is created by 4 SSDs. The tool is
      dbench.
      
      1. The latest linux stable kernel
        Operation                Count    AvgLat    MaxLat
        --------------------------------------------------
        Deltree                    768    10.509    78.305
        Flush                  2078376     0.013    10.094
        Close                  21787697     0.019    18.821
        LockX                    96580     0.007     3.184
        Mkdir                      384     0.008     0.062
        Rename                 1255883     0.191    23.534
        ReadX                  46495589     0.020    14.230
        WriteX                 14790591     7.123    60.706
        Unlink                 5989118     0.440    54.551
        UnlockX                  96580     0.005     2.736
        FIND_FIRST             10393845     0.042    12.079
        SET_FILE_INFORMATION   2415558     0.129    10.088
        QUERY_FILE_INFORMATION 4711725     0.005     8.462
        QUERY_PATH_INFORMATION 26883327     0.032    21.715
        QUERY_FS_INFORMATION   4929409b     0.010     8.238
        NTCreateX              29660080     0.100    53.268
      
      Throughput 1034.88 MB/sec (sync open)  128 clients  128 procs
      max_latency=60.712 ms
      
      2. With patch1 "Revert "MD: fix lock contention for flush bios""
        Operation                Count    AvgLat    MaxLat
        --------------------------------------------------
        Deltree                    256     8.326    36.761
        Flush                   693291     3.974   180.269
        Close                  7266404     0.009    36.929
        LockX                    32160     0.006     0.840
        Mkdir                      128     0.008     0.021
        Rename                  418755     0.063    29.945
        ReadX                  15498708     0.007     7.216
        WriteX                 4932310    22.482   267.928
        Unlink                 1997557     0.109    47.553
        UnlockX                  32160     0.004     1.110
        FIND_FIRST             3465791     0.036     7.320
        SET_FILE_INFORMATION    805825     0.015     1.561
        QUERY_FILE_INFORMATION 1570950     0.005     2.403
        QUERY_PATH_INFORMATION 8965483     0.013    14.277
        QUERY_FS_INFORMATION   1643626     0.009     3.314
        NTCreateX              9892174     0.061    41.278
      
      Throughput 345.009 MB/sec (sync open)  128 clients  128 procs
      max_latency=267.939 m
      
      3. With patch1 and patch2
        Operation                Count    AvgLat    MaxLat
        --------------------------------------------------
        Deltree                    768     9.570    54.588
        Flush                  2061354     0.666    15.102
        Close                  21604811     0.012    25.697
        LockX                    95770     0.007     1.424
        Mkdir                      384     0.008     0.053
        Rename                 1245411     0.096    12.263
        ReadX                  46103198     0.011    12.116
        WriteX                 14667988     7.375    60.069
        Unlink                 5938936     0.173    30.905
        UnlockX                  95770     0.005     4.147
        FIND_FIRST             10306407     0.041    11.715
        SET_FILE_INFORMATION   2395987     0.048     7.640
        QUERY_FILE_INFORMATION 4672371     0.005     9.291
        QUERY_PATH_INFORMATION 26656735     0.018    19.719
        QUERY_FS_INFORMATION   4887940     0.010     7.654
        NTCreateX              29410811     0.059    28.551
      
      Throughput 1026.21 MB/sec (sync open)  128 clients  128 procs
      max_latency=60.075 ms
      
      Cc: <stable@vger.kernel.org> # v4.19+
      Tested-by: NXiao Ni <xni@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2bc13b83
    • N
      Revert "MD: fix lock contention for flush bios" · 4bc034d3
      NeilBrown 提交于
      This reverts commit 5a409b4f.
      
      This patch has two problems.
      
      1/ it make multiple calls to submit_bio() from inside a make_request_fn.
       The bios thus submitted will be queued on current->bio_list and not
       submitted immediately.  As the bios are allocated from a mempool,
       this can theoretically result in a deadlock - all the pool of requests
       could be in various ->bio_list queues and a subsequent mempool_alloc
       could block waiting for one of them to be released.
      
      2/ It aims to handle a case when there are many concurrent flush requests.
        It handles this by submitting many requests in parallel - all of which
        are identical and so most of which do nothing useful.
        It would be more efficient to just send one lower-level request, but
        allow that to satisfy multiple upper-level requests.
      
      Fixes: 5a409b4f ("MD: fix lock contention for flush bios")
      Cc: <stable@vger.kernel.org> # v4.19+
      Tested-by: NXiao Ni <xni@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4bc034d3
    • N
      Don't jump to compute_result state from check_result state · 4f4fd7c5
      Nigel Croxon 提交于
      Changing state from check_state_check_result to
      check_state_compute_result not only is unsafe but also doesn't
      appear to serve a valid purpose.  A raid6 check should only be
      pushing out extra writes if doing repair and a mis-match occurs.
      The stripe dev management will already try and do repair writes
      for failing sectors.
      
      This patch makes the raid6 check_state_check_result handling
      work more like raid5's.  If somehow too many failures for a
      check, just quit the check operation for the stripe.  When any
      checks pass, don't try and use check_state_compute_result for
      a purpose it isn't needed for and is unsafe for.  Just mark the
      stripe as in sync for passing its parity checks and let the
      stripe dev read/write code and the bad blocks list do their
      job handling I/O errors.
      
      Repro steps from Xiao:
      
      These are the steps to reproduce this problem:
      1. redefined OPT_MEDIUM_ERR_ADDR to 12000 in scsi_debug.c
      2. insmod scsi_debug.ko dev_size_mb=11000  max_luns=1 num_tgts=1
      3. mdadm --create /dev/md127 --level=6 --raid-devices=5 /dev/sde1 /dev/sde2 /dev/sde3 /dev/sde5 /dev/sde6
      sde is the disk created by scsi_debug
      4. echo "2" >/sys/module/scsi_debug/parameters/opts
      5. raid-check
      
      It panic:
      [ 4854.730899] md: data-check of RAID array md127
      [ 4854.857455] sd 5:0:0:0: [sdr] tag#80 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
      [ 4854.859246] sd 5:0:0:0: [sdr] tag#80 Sense Key : Medium Error [current]
      [ 4854.860694] sd 5:0:0:0: [sdr] tag#80 Add. Sense: Unrecovered read error
      [ 4854.862207] sd 5:0:0:0: [sdr] tag#80 CDB: Read(10) 28 00 00 00 2d 88 00 04 00 00
      [ 4854.864196] print_req_error: critical medium error, dev sdr, sector 11656 flags 0
      [ 4854.867409] sd 5:0:0:0: [sdr] tag#100 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
      [ 4854.869469] sd 5:0:0:0: [sdr] tag#100 Sense Key : Medium Error [current]
      [ 4854.871206] sd 5:0:0:0: [sdr] tag#100 Add. Sense: Unrecovered read error
      [ 4854.872858] sd 5:0:0:0: [sdr] tag#100 CDB: Read(10) 28 00 00 00 2e e0 00 00 08 00
      [ 4854.874587] print_req_error: critical medium error, dev sdr, sector 12000 flags 4000
      [ 4854.876456] sd 5:0:0:0: [sdr] tag#101 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
      [ 4854.878552] sd 5:0:0:0: [sdr] tag#101 Sense Key : Medium Error [current]
      [ 4854.880278] sd 5:0:0:0: [sdr] tag#101 Add. Sense: Unrecovered read error
      [ 4854.881846] sd 5:0:0:0: [sdr] tag#101 CDB: Read(10) 28 00 00 00 2e e8 00 00 08 00
      [ 4854.883691] print_req_error: critical medium error, dev sdr, sector 12008 flags 4000
      [ 4854.893927] sd 5:0:0:0: [sdr] tag#166 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
      [ 4854.896002] sd 5:0:0:0: [sdr] tag#166 Sense Key : Medium Error [current]
      [ 4854.897561] sd 5:0:0:0: [sdr] tag#166 Add. Sense: Unrecovered read error
      [ 4854.899110] sd 5:0:0:0: [sdr] tag#166 CDB: Read(10) 28 00 00 00 2e e0 00 00 10 00
      [ 4854.900989] print_req_error: critical medium error, dev sdr, sector 12000 flags 0
      [ 4854.902757] md/raid:md127: read error NOT corrected!! (sector 9952 on sdr1).
      [ 4854.904375] md/raid:md127: read error NOT corrected!! (sector 9960 on sdr1).
      [ 4854.906201] ------------[ cut here ]------------
      [ 4854.907341] kernel BUG at drivers/md/raid5.c:4190!
      
      raid5.c:4190 above is this BUG_ON:
      
          handle_parity_checks6()
              ...
              BUG_ON(s->uptodate < disks - 1); /* We don't need Q to recover */
      
      Cc: <stable@vger.kernel.org> # v3.16+
      OriginalAuthor: David Jeffery <djeffery@redhat.com>
      Cc: Xiao Ni <xni@redhat.com>
      Tested-by: NDavid Jeffery <djeffery@redhat.com>
      Signed-off-by: NDavid Jeffy <djeffery@redhat.com>
      Signed-off-by: NNigel Croxon <ncroxon@redhat.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4f4fd7c5
  8. 13 3月, 2019 4 次提交
  9. 06 3月, 2019 3 次提交
新手
引导
客服 返回
顶部