1. 14 2月, 2017 3 次提交
    • S
      md/r5cache: enable chunk_aligned_read with write back cache · 03b047f4
      Song Liu 提交于
      Chunk aligned read significantly reduces CPU usage of raid456.
      However, it is not safe to fully bypass the write back cache.
      This patch enables chunk aligned read with write back cache.
      
      For chunk aligned read, we track stripes in write back cache at
      a bigger granularity, "big_stripe". Each chunk may contain more
      than one stripe (for example, a 256kB chunk contains 64 4kB-page,
      so this chunk contain 64 stripes). For chunk_aligned_read, these
      stripes are grouped into one big_stripe, so we only need one lookup
      for the whole chunk.
      
      For each big_stripe, struct big_stripe_info tracks how many stripes
      of this big_stripe are in the write back cache. We count how many
      stripes of this big_stripe are in the write back cache. These
      counters are tracked in a radix tree (big_stripe_tree).
      r5c_tree_index() is used to calculate keys for the radix tree.
      
      chunk_aligned_read() calls r5c_big_stripe_cached() to look up
      big_stripe of each chunk in the tree. If this big_stripe is in the
      tree, chunk_aligned_read() aborts. This look up is protected by
      rcu_read_lock().
      
      It is necessary to remember whether a stripe is counted in
      big_stripe_tree. Instead of adding new flag, we reuses existing flags:
      STRIPE_R5C_PARTIAL_STRIPE and STRIPE_R5C_FULL_STRIPE. If either of these
      two flags are set, the stripe is counted in big_stripe_tree. This
      requires moving set_bit(STRIPE_R5C_PARTIAL_STRIPE) to
      r5c_try_caching_write(); and moving clear_bit of
      STRIPE_R5C_PARTIAL_STRIPE and STRIPE_R5C_FULL_STRIPE to
      r5c_finish_stripe_write_out().
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      03b047f4
    • S
      raid5: only dispatch IO from raid5d for harddisk raid · 765d704d
      Shaohua Li 提交于
      We made raid5 stripe handling multi-thread before. It works well for
      SSD. But for harddisk, the multi-threading creates more disk seek, so
      not always improve performance. For several hard disks based raid5,
      multi-threading is required as raid5d becames a bottleneck especially
      for sequential write.
      
      To overcome the disk seek issue, we only dispatch IO from raid5d if the
      array is harddisk based. Other threads can still handle stripes, but
      can't dispatch IO.
      
      Idealy, we should control IO dispatching order according to IO position
      interrnally. Right now we still depend on block layer, which isn't very
      efficient sometimes though.
      
      My setup has 9 harddisks, each disk can do around 180M/s sequential
      write. So in theory, the raid5 can do 180 * 8 = 1440M/s sequential
      write. The test machine uses an ATOM CPU. I measure sequential write
      with large iodepth bandwidth to raid array:
      
      without patch: ~600M/s
      without patch and group_thread_cnt=4: 750M/s
      with patch and group_thread_cnt=4: 950M/s
      with patch, group_thread_cnt=4, skip_copy=1: 1150M/s
      
      We are pretty close to the maximum bandwidth in the large iodepth
      iodepth case. The performance gap of small iodepth sequential write
      between software raid and theory value is still very big though, because
      we don't have an efficient pipeline.
      
      Cc: NeilBrown <neilb@suse.com>
      Cc: Song Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      765d704d
    • C
      md linear: fix a race between linear_add() and linear_congested() · 03a9e24e
      colyli@suse.de 提交于
      Recently I receive a bug report that on Linux v3.0 based kerenl, hot add
      disk to a md linear device causes kernel crash at linear_congested(). From
      the crash image analysis, I find in linear_congested(), mddev->raid_disks
      contains value N, but conf->disks[] only has N-1 pointers available. Then
      a NULL pointer deference crashes the kernel.
      
      There is a race between linear_add() and linear_congested(), RCU stuffs
      used in these two functions cannot avoid the race. Since Linuv v4.0
      RCU code is replaced by introducing mddev_suspend().  After checking the
      upstream code, it seems linear_congested() is not called in
      generic_make_request() code patch, so mddev_suspend() cannot provent it
      from being called. The possible race still exists.
      
      Here I explain how the race still exists in current code.  For a machine
      has many CPUs, on one CPU, linear_add() is called to add a hard disk to a
      md linear device; at the same time on other CPU, linear_congested() is
      called to detect whether this md linear device is congested before issuing
      an I/O request onto it.
      
      Now I use a possible code execution time sequence to demo how the possible
      race happens,
      
      seq    linear_add()                linear_congested()
       0                                 conf=mddev->private
       1   oldconf=mddev->private
       2   mddev->raid_disks++
       3                              for (i=0; i<mddev->raid_disks;i++)
       4                                bdev_get_queue(conf->disks[i].rdev->bdev)
       5   mddev->private=newconf
      
      In linear_add() mddev->raid_disks is increased in time seq 2, and on
      another CPU in linear_congested() the for-loop iterates conf->disks[i] by
      the increased mddev->raid_disks in time seq 3,4. But conf with one more
      element (which is a pointer to struct dev_info type) to conf->disks[] is
      not updated yet, accessing its structure member in time seq 4 will cause a
      NULL pointer deference fault.
      
      To fix this race, there are 2 parts of modification in the patch,
       1) Add 'int raid_disks' in struct linear_conf, as a copy of
          mddev->raid_disks. It is initialized in linear_conf(), always being
          consistent with pointers number of 'struct dev_info disks[]'. When
          iterating conf->disks[] in linear_congested(), use conf->raid_disks to
          replace mddev->raid_disks in the for-loop, then NULL pointer deference
          will not happen again.
       2) RCU stuffs are back again, and use kfree_rcu() in linear_add() to
          free oldconf memory. Because oldconf may be referenced as mddev->private
          in linear_congested(), kfree_rcu() makes sure that its memory will not
          be released until no one uses it any more.
      Also some code comments are added in this patch, to make this modification
      to be easier understandable.
      
      This patch can be applied for kernels since v4.0 after commit:
      3be260cc ("md/linear: remove rcu protections in favour of
      suspend/resume"). But this bug is reported on Linux v3.0 based kernel, for
      people who maintain kernels before Linux v4.0, they need to do some back
      back port to this patch.
      
      Changelog:
       - V3: add 'int raid_disks' in struct linear_conf, and use kfree_rcu() to
             replace rcu_call() in linear_add().
       - v2: add RCU stuffs by suggestion from Shaohua and Neil.
       - v1: initial effort.
      Signed-off-by: NColy Li <colyli@suse.de>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Neil Brown <neilb@suse.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NShaohua Li <shli@fb.com>
      03a9e24e
  2. 03 2月, 2017 3 次提交
    • O
      dm crypt: replace RCU read-side section with rwsem · f5b0cba8
      Ondrej Kozina 提交于
      The lockdep splat below hints at a bug in RCU usage in dm-crypt that
      was introduced with commit c538f6ec ("dm crypt: add ability to use
      keys from the kernel key retention service").  The kernel keyring
      function user_key_payload() is in fact a wrapper for
      rcu_dereference_protected() which must not be called with only
      rcu_read_lock() section mark.
      
      Unfortunately the kernel keyring subsystem doesn't currently provide
      an interface that allows the use of an RCU read-side section.  So for
      now we must drop RCU in favour of rwsem until a proper function is
      made available in the kernel keyring subsystem.
      
      ===============================
      [ INFO: suspicious RCU usage. ]
      4.10.0-rc5 #2 Not tainted
      -------------------------------
      ./include/keys/user-type.h:53 suspicious rcu_dereference_protected() usage!
      other info that might help us debug this:
      rcu_scheduler_active = 2, debug_locks = 1
      2 locks held by cryptsetup/6464:
       #0:  (&md->type_lock){+.+.+.}, at: [<ffffffffa02472a2>] dm_lock_md_type+0x12/0x20 [dm_mod]
       #1:  (rcu_read_lock){......}, at: [<ffffffffa02822f8>] crypt_set_key+0x1d8/0x4b0 [dm_crypt]
      stack backtrace:
      CPU: 1 PID: 6464 Comm: cryptsetup Not tainted 4.10.0-rc5 #2
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.1-1.fc24 04/01/2014
      Call Trace:
       dump_stack+0x67/0x92
       lockdep_rcu_suspicious+0xc5/0x100
       crypt_set_key+0x351/0x4b0 [dm_crypt]
       ? crypt_set_key+0x1d8/0x4b0 [dm_crypt]
       crypt_ctr+0x341/0xa53 [dm_crypt]
       dm_table_add_target+0x147/0x330 [dm_mod]
       table_load+0x111/0x350 [dm_mod]
       ? retrieve_status+0x1c0/0x1c0 [dm_mod]
       ctl_ioctl+0x1f5/0x510 [dm_mod]
       dm_ctl_ioctl+0xe/0x20 [dm_mod]
       do_vfs_ioctl+0x8e/0x690
       ? ____fput+0x9/0x10
       ? task_work_run+0x7e/0xa0
       ? trace_hardirqs_on_caller+0x122/0x1b0
       SyS_ioctl+0x3c/0x70
       entry_SYSCALL_64_fastpath+0x18/0xad
      RIP: 0033:0x7f392c9a4ec7
      RSP: 002b:00007ffef6383378 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
      RAX: ffffffffffffffda RBX: 00007ffef63830a0 RCX: 00007f392c9a4ec7
      RDX: 000000000124fcc0 RSI: 00000000c138fd09 RDI: 0000000000000005
      RBP: 00007ffef6383090 R08: 00000000ffffffff R09: 00000000012482b0
      R10: 2a28205d34383336 R11: 0000000000000246 R12: 00007f392d803a08
      R13: 00007ffef63831e0 R14: 0000000000000000 R15: 00007f392d803a0b
      
      Fixes: c538f6ec ("dm crypt: add ability to use keys from the kernel key retention service")
      Reported-by: NMilan Broz <mbroz@redhat.com>
      Signed-off-by: NOndrej Kozina <okozina@redhat.com>
      Reviewed-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      f5b0cba8
    • M
      dm rq: cope with DM device destruction while in dm_old_request_fn() · 4087a1ff
      Mike Snitzer 提交于
      Fixes a crash in dm_table_find_target() due to a NULL struct dm_table
      being passed from dm_old_request_fn() that races with DM device
      destruction.
      
      Reported-by: artem@flashgrid.io
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      4087a1ff
    • M
  3. 25 1月, 2017 6 次提交
    • S
      md/r5cache: disable write back for degraded array · 2e38a37f
      Song Liu 提交于
      write-back cache in degraded mode introduces corner cases to the array.
      Although we try to cover all these corner cases, it is safer to just
      disable write-back cache when the array is in degraded mode.
      
      In this patch, we disable writeback cache for degraded mode:
      1. On device failure, if the array enters degraded mode, raid5_error()
         will submit async job r5c_disable_writeback_async to disable
         writeback;
      2. In r5c_journal_mode_store(), it is invalid to enable writeback in
         degraded mode;
      3. In r5c_try_caching_write(), stripes with s->failed>0 will be handled
         in write-through mode.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      2e38a37f
    • S
      md/r5cache: shift complex rmw from read path to write path · 07e83364
      Song Liu 提交于
      Write back cache requires a complex RMW mechanism, where old data is
      read into dev->orig_page for prexor, and then xor is done with
      dev->page. This logic is already implemented in the write path.
      
      However, current read path is not awared of this requirement. When
      the array is optimal, the RMW is not required, as the data are
      read from raid disks. However, when the target stripe is degraded,
      complex RMW is required to generate right data.
      
      To keep read path as clean as possible, we handle read path by
      flushing degraded, in-journal stripes before processing reads to
      missing dev.
      
      Specifically, when there is read requests to a degraded stripe
      with data in journal, handle_stripe_fill() calls
      r5c_make_stripe_write_out() and exits. Then handle_stripe_dirtying()
      will do the complex RMW and flush the stripe to RAID disks. After
      that, read requests are handled.
      
      There is one more corner case when there is non-overwrite bio for
      the missing (or out of sync) dev. handle_stripe_dirtying() will not
      be able to process the non-overwrite bios without constructing the
      data in handle_stripe_fill(). This is fixed by delaying non-overwrite
      bios in handle_stripe_dirtying(). So handle_stripe_fill() works on
      these bios after the stripe is flushed to raid disks.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      07e83364
    • S
      md/r5cache: flush data only stripes in r5l_recovery_log() · a85dd7b8
      Song Liu 提交于
      For safer operation, all arrays start in write-through mode, which has been
      better tested and is more mature. And actually the write-through/write-mode
      isn't persistent after array restarted, so we always start array in
      write-through mode. However, if recovery found data-only stripes before the
      shutdown (from previous write-back mode), it is not safe to start the array in
      write-through mode, as write-through mode can not handle stripes with data in
      write-back cache. To solve this problem, we flush all data-only stripes in
      r5l_recovery_log(). When r5l_recovery_log() returns, the array starts with
      empty cache in write-through mode.
      
      This logic is implemented in r5c_recovery_flush_data_only_stripes():
      
      1. enable write back cache
      2. flush all stripes
      3. wake up conf->mddev->thread
      4. wait for all stripes get flushed (reuse wait_for_quiescent)
      5. disable write back cache
      
      The wait in 4 will be waked up in release_inactive_stripe_list()
      when conf->active_stripes reaches 0.
      
      It is safe to wake up mddev->thread here because all the resource
      required for the thread has been initialized.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      a85dd7b8
    • S
      md/raid5: move comment of fetch_block to right location · ba02684d
      Song Liu 提交于
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      ba02684d
    • S
      md/r5cache: read data into orig_page for prexor of cached data · 86aa1397
      Song Liu 提交于
      With write back cache, we use orig_page to do prexor. This patch
      makes sure we read data into orig_page for it.
      
      Flag R5_OrigPageUPTDODATE is added to show whether orig_page
      has the latest data from raid disk.
      
      We introduce a helper function uptodate_for_rmw() to simplify
      the a couple conditions in handle_stripe_dirtying().
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      86aa1397
    • S
      md/raid5-cache: delete meaningless code · d46d29f0
      Shaohua Li 提交于
      sector_t is unsigned long, it's never < 0
      Reported-by: NJulia Lawall <julia.lawall@lip6.fr>
      Signed-off-by: NShaohua Li <shli@fb.com>
      d46d29f0
  4. 10 1月, 2017 1 次提交
  5. 06 1月, 2017 5 次提交
  6. 04 1月, 2017 2 次提交
  7. 25 12月, 2016 1 次提交
  8. 18 12月, 2016 2 次提交
  9. 16 12月, 2016 1 次提交
  10. 14 12月, 2016 1 次提交
    • M
      dm flakey: introduce "error_writes" feature · ef548c55
      Mike Snitzer 提交于
      Recent dm-flakey fixes, to have reads error out during the "down"
      interval, made it so that the previous read behaviour is no longer
      available.
      
      It is useful to have reads complete like normal but have writes error
      out, so make it possible again with a new "error_writes" feature.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      ef548c55
  11. 09 12月, 2016 15 次提交