1. 06 11月, 2019 2 次提交
    • H
      dm raid: change rs_set_dev_and_array_sectors API and callers · 22c992e1
      Heinz Mauelshagen 提交于
      Add a size argument to rs_set_dev_and_array_sectors as prerequisite
      to fixing grown device resynchronization not occuring when new MD
      bitmap pages have to be allocated as a result of the extension in
      a follwup patch.
      
      Also avoid code duplication by using rs_set_rdev_sectors
      in the aforementioned function.
      Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      22c992e1
    • M
      dm table: do not allow request-based DM to stack on partitions · 6ba01df7
      Mike Snitzer 提交于
      Partitioned request-based devices cannot be used as underlying devices
      for request-based DM because no partition offsets are added to each
      incoming request.  As such, until now, stacking on partitioned devices
      would _always_ result in data corruption (e.g. wiping the partition
      table, writing to other partitions, etc).  Fix this by disallowing
      request-based stacking on partitions.
      
      While at it, since all .request_fn support has been removed from block
      core, remove legacy dm-table code that differentiated between blk-mq and
      .request_fn request-based.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      6ba01df7
  2. 17 10月, 2019 2 次提交
    • M
      dm cache: fix bugs when a GFP_NOWAIT allocation fails · 13bd677a
      Mikulas Patocka 提交于
      GFP_NOWAIT allocation can fail anytime - it doesn't wait for memory being
      available and it fails if the mempool is exhausted and there is not enough
      memory.
      
      If we go down this path:
        map_bio -> mg_start -> alloc_migration -> mempool_alloc(GFP_NOWAIT)
      we can see that map_bio() doesn't check the return value of mg_start(),
      and the bio is leaked.
      
      If we go down this path:
        map_bio -> mg_start -> mg_lock_writes -> alloc_prison_cell ->
        dm_bio_prison_alloc_cell_v2 -> mempool_alloc(GFP_NOWAIT) ->
        mg_lock_writes -> mg_complete
      the bio is ended with an error - it is unacceptable because it could
      cause filesystem corruption if the machine ran out of memory
      temporarily.
      
      Change GFP_NOWAIT to GFP_NOIO, so that the mempool code will properly
      wait until memory becomes available. mempool_alloc with GFP_NOIO can't
      fail, so remove the code paths that deal with allocation failure.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      13bd677a
    • S
      md/raid0: fix warning message for parameter default_layout · 3874d73e
      Song Liu 提交于
      The message should match the parameter, i.e. raid0.default_layout.
      
      Fixes: c84a1372 ("md/raid0: avoid RAID0 data corruption due to layout confusion.")
      Cc: NeilBrown <neilb@suse.de>
      Reported-by: NIvan Topolsky <doktor.yak@gmail.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      3874d73e
  3. 10 10月, 2019 2 次提交
    • M
      dm snapshot: rework COW throttling to fix deadlock · b2155578
      Mikulas Patocka 提交于
      Commit 721b1d98 ("dm snapshot: Fix excessive memory usage and
      workqueue stalls") introduced a semaphore to limit the maximum number of
      in-flight kcopyd (COW) jobs.
      
      The implementation of this throttling mechanism is prone to a deadlock:
      
      1. One or more threads write to the origin device causing COW, which is
         performed by kcopyd.
      
      2. At some point some of these threads might reach the s->cow_count
         semaphore limit and block in down(&s->cow_count), holding a read lock
         on _origins_lock.
      
      3. Someone tries to acquire a write lock on _origins_lock, e.g.,
         snapshot_ctr(), which blocks because the threads at step (2) already
         hold a read lock on it.
      
      4. A COW operation completes and kcopyd runs dm-snapshot's completion
         callback, which ends up calling pending_complete().
         pending_complete() tries to resubmit any deferred origin bios. This
         requires acquiring a read lock on _origins_lock, which blocks.
      
         This happens because the read-write semaphore implementation gives
         priority to writers, meaning that as soon as a writer tries to enter
         the critical section, no readers will be allowed in, until all
         writers have completed their work.
      
         So, pending_complete() waits for the writer at step (3) to acquire
         and release the lock. This writer waits for the readers at step (2)
         to release the read lock and those readers wait for
         pending_complete() (the kcopyd thread) to signal the s->cow_count
         semaphore: DEADLOCK.
      
      The above was thoroughly analyzed and documented by Nikos Tsironis as
      part of his initial proposal for fixing this deadlock, see:
      https://www.redhat.com/archives/dm-devel/2019-October/msg00001.html
      
      Fix this deadlock by reworking COW throttling so that it waits without
      holding any locks. Add a variable 'in_progress' that counts how many
      kcopyd jobs are running. A function wait_for_in_progress() will sleep if
      'in_progress' is over the limit. It drops _origins_lock in order to
      avoid the deadlock.
      Reported-by: NGuruswamy Basavaiah <guru2018@gmail.com>
      Reported-by: NNikos Tsironis <ntsironis@arrikto.com>
      Reviewed-by: NNikos Tsironis <ntsironis@arrikto.com>
      Tested-by: NNikos Tsironis <ntsironis@arrikto.com>
      Fixes: 721b1d98 ("dm snapshot: Fix excessive memory usage and workqueue stalls")
      Cc: stable@vger.kernel.org # v5.0+
      Depends-on: 4a3f111a73a8c ("dm snapshot: introduce account_start_copy() and account_end_copy()")
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      b2155578
    • M
      dm snapshot: introduce account_start_copy() and account_end_copy() · a2f83e8b
      Mikulas Patocka 提交于
      This simple refactoring moves code for modifying the semaphore cow_count
      into separate functions to prepare for changes that will extend these
      methods to provide for a more sophisticated mechanism for COW
      throttling.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Reviewed-by: NNikos Tsironis <ntsironis@arrikto.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      a2f83e8b
  4. 09 10月, 2019 1 次提交
  5. 18 9月, 2019 1 次提交
  6. 16 9月, 2019 1 次提交
  7. 14 9月, 2019 7 次提交
    • M
      dm bufio: introduce a global cache replacement · 6e913b28
      Mikulas Patocka 提交于
      This commit introduces a global cache replacement (instead of per-client
      cleanup).
      
      If one bufio client uses the cache heavily and another client is not using
      it, we want to let the first client use most of the cache. The old
      algorithm would partition the cache equally betwen the clients and that is
      sub-optimal.
      
      For cache replacement, we use the clock algorithm because it doesn't
      require taking any lock when the buffer is accessed.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      6e913b28
    • G
      raid5: use bio_end_sector in r5_next_bio · 067df25c
      Guoqing Jiang 提交于
      Actually, we calculate bio's end sector here, so use the common
      way for the purpose.
      Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      067df25c
    • G
      raid5: remove STRIPE_OPS_REQ_PENDING · feb9bf98
      Guoqing Jiang 提交于
      This stripe state is not used anymore after commit 51acbcec
      ("md: remove CONFIG_MULTICORE_RAID456"), so remove the obsoleted
      state.
      
      gjiang@nb01257:~/md$ grep STRIPE_OPS_REQ_PENDING drivers/md/ -r
      drivers/md/raid5.c:					  (1 << STRIPE_OPS_REQ_PENDING) |
      drivers/md/raid5.h:	STRIPE_OPS_REQ_PENDING,
      Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      feb9bf98
    • N
      md: add feature flag MD_FEATURE_RAID0_LAYOUT · 33f2c35a
      NeilBrown 提交于
      Due to a bug introduced in Linux 3.14 we cannot determine the
      correctly layout for a multi-zone RAID0 array - there are two
      possibilities.
      
      It is possible to tell the kernel which to chose using a module
      parameter, but this can be clumsy to use.  It would be best if
      the choice were recorded in the metadata.
      So add a feature flag for this purpose.
      If it is set, then the 'layout' field of the superblock is used
      to determine which layout to use.
      
      If this flag is not set, then mddev->layout gets set to -1,
      which causes the module parameter to be required.
      Acked-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      33f2c35a
    • N
      md/raid0: avoid RAID0 data corruption due to layout confusion. · c84a1372
      NeilBrown 提交于
      If the drives in a RAID0 are not all the same size, the array is
      divided into zones.
      The first zone covers all drives, to the size of the smallest.
      The second zone covers all drives larger than the smallest, up to
      the size of the second smallest - etc.
      
      A change in Linux 3.14 unintentionally changed the layout for the
      second and subsequent zones.  All the correct data is still stored, but
      each chunk may be assigned to a different device than in pre-3.14 kernels.
      This can lead to data corruption.
      
      It is not possible to determine what layout to use - it depends which
      kernel the data was written by.
      So we add a module parameter to allow the old (0) or new (1) layout to be
      specified, and refused to assemble an affected array if that parameter is
      not set.
      
      Fixes: 20d0189b ("block: Introduce new bio_split()")
      cc: stable@vger.kernel.org (3.14+)
      Acked-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      c84a1372
    • G
      raid5: don't set STRIPE_HANDLE to stripe which is in batch list · 6ce220dd
      Guoqing Jiang 提交于
      If stripe in batch list is set with STRIPE_HANDLE flag, then the stripe
      could be set with STRIPE_ACTIVE by the handle_stripe function. And if
      error happens to the batch_head at the same time, break_stripe_batch_list
      is called, then below warning could happen (the same report in [1]), it
      means a member of batch list was set with STRIPE_ACTIVE.
      
      [7028915.431770] stripe state: 2001
      [7028915.431815] ------------[ cut here ]------------
      [7028915.431828] WARNING: CPU: 18 PID: 29089 at drivers/md/raid5.c:4614 break_stripe_batch_list+0x203/0x240 [raid456]
      [...]
      [7028915.431879] CPU: 18 PID: 29089 Comm: kworker/u82:5 Tainted: G           O    4.14.86-1-storage #4.14.86-1.2~deb9
      [7028915.431881] Hardware name: Supermicro SSG-2028R-ACR24L/X10DRH-iT, BIOS 3.1 06/18/2018
      [7028915.431888] Workqueue: raid5wq raid5_do_work [raid456]
      [7028915.431890] task: ffff9ab0ef36d7c0 task.stack: ffffb72926f84000
      [7028915.431896] RIP: 0010:break_stripe_batch_list+0x203/0x240 [raid456]
      [7028915.431898] RSP: 0018:ffffb72926f87ba8 EFLAGS: 00010286
      [7028915.431900] RAX: 0000000000000012 RBX: ffff9aaa84a98000 RCX: 0000000000000000
      [7028915.431901] RDX: 0000000000000000 RSI: ffff9ab2bfa15458 RDI: ffff9ab2bfa15458
      [7028915.431902] RBP: ffff9aaa8fb4e900 R08: 0000000000000001 R09: 0000000000002eb4
      [7028915.431903] R10: 00000000ffffffff R11: 0000000000000000 R12: ffff9ab1736f1b00
      [7028915.431904] R13: 0000000000000000 R14: ffff9aaa8fb4e900 R15: 0000000000000001
      [7028915.431906] FS:  0000000000000000(0000) GS:ffff9ab2bfa00000(0000) knlGS:0000000000000000
      [7028915.431907] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [7028915.431908] CR2: 00007ff953b9f5d8 CR3: 0000000bf4009002 CR4: 00000000003606e0
      [7028915.431909] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [7028915.431910] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [7028915.431910] Call Trace:
      [7028915.431923]  handle_stripe+0x8e7/0x2020 [raid456]
      [7028915.431930]  ? __wake_up_common_lock+0x89/0xc0
      [7028915.431935]  handle_active_stripes.isra.58+0x35f/0x560 [raid456]
      [7028915.431939]  raid5_do_work+0xc6/0x1f0 [raid456]
      
      Also commit 59fc630b ("RAID5: batch adjacent full stripe write")
      said "If a stripe is added to batch list, then only the first stripe
      of the list should be put to handle_list and run handle_stripe."
      
      So don't set STRIPE_HANDLE to stripe which is already in batch list,
      otherwise the stripe could be put to handle_list and run handle_stripe,
      then the above warning could be triggered.
      
      [1]. https://www.spinics.net/lists/raid/msg62552.htmlSigned-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      6ce220dd
    • N
      raid5: don't increment read_errors on EILSEQ return · b76b4715
      Nigel Croxon 提交于
      While MD continues to count read errors returned by the lower layer.
      If those errors are -EILSEQ, instead of -EIO, it should NOT increase
      the read_errors count.
      
      When RAID6 is set up on dm-integrity target that detects massive
      corruption, the leg will be ejected from the array.  Even if the
      issue is correctable with a sector re-write and the array has
      necessary redundancy to correct it.
      
      The leg is ejected because it runs up the rdev->read_errors beyond
      conf->max_nr_stripes.  The return status in dm-drypt when there is
      a data integrity error is -EILSEQ (BLK_STS_PROTECTION).
      Signed-off-by: NNigel Croxon <ncroxon@redhat.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      b76b4715
  8. 13 9月, 2019 4 次提交
  9. 12 9月, 2019 2 次提交
    • N
      dm: add clone target · 7431b783
      Nikos Tsironis 提交于
      Add the dm-clone target, which allows cloning of arbitrary block
      devices.
      
      dm-clone produces a one-to-one copy of an existing, read-only source
      device into a writable destination device: It presents a virtual block
      device which makes all data appear immediately, and redirects reads and
      writes accordingly.
      
      The main use case of dm-clone is to clone a potentially remote,
      high-latency, read-only, archival-type block device into a writable,
      fast, primary-type device for fast, low-latency I/O. The cloned device
      is visible/mountable immediately and the copy of the source device to
      the destination device happens in the background, in parallel with user
      I/O.
      
      When the cloning completes, the dm-clone table can be removed altogether
      and be replaced, e.g., by a linear table, mapping directly to the
      destination device.
      
      For further information and examples of how to use dm-clone, please read
      Documentation/admin-guide/device-mapper/dm-clone.rst
      Suggested-by: NVangelis Koukis <vkoukis@arrikto.com>
      Co-developed-by: NIlias Tsitsimpis <iliastsi@arrikto.com>
      Signed-off-by: NIlias Tsitsimpis <iliastsi@arrikto.com>
      Signed-off-by: NNikos Tsironis <ntsironis@arrikto.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      7431b783
    • M
      dm raid: fix updating of max_discard_sectors limit · c8156fc7
      Ming Lei 提交于
      Unit of 'chunk_size' is byte, instead of sector, so fix it by setting
      the queue_limits' max_discard_sectors to rs->md.chunk_sectors.  Also,
      rename chunk_size to chunk_size_bytes.
      
      Without this fix, too big max_discard_sectors is applied on the request
      queue of dm-raid, finally raid code has to split the bio again.
      
      This re-split done by raid causes the following nested clone_endio:
      
      1) one big bio 'A' is submitted to dm queue, and served as the original
      bio
      
      2) one new bio 'B' is cloned from the original bio 'A', and .map()
      is run on this bio of 'B', and B's original bio points to 'A'
      
      3) raid code sees that 'B' is too big, and split 'B' and re-submit
      the remainded part of 'B' to dm-raid queue via generic_make_request().
      
      4) now dm will handle 'B' as new original bio, then allocate a new
      clone bio of 'C' and run .map() on 'C'. Meantime C's original bio
      points to 'B'.
      
      5) suppose now 'C' is completed by raid directly, then the following
      clone_endio() is called recursively:
      
      	clone_endio(C)
      		->clone_endio(B)		#B is original bio of 'C'
      			->bio_endio(A)
      
      'A' can be big enough to make hundreds of nested clone_endio(), then
      stack can be corrupted easily.
      
      Fixes: 61697a6a ("dm: eliminate 'split_discard_bios' flag from DM target interface")
      Cc: stable@vger.kernel.org
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      c8156fc7
  10. 06 9月, 2019 2 次提交
    • D
      block: Delay default elevator initialization · 737eb78e
      Damien Le Moal 提交于
      When elevator_init_mq() is called from blk_mq_init_allocated_queue(),
      the only information known about the device is the number of hardware
      queues as the block device scan by the device driver is not completed
      yet for most drivers. The device type and elevator required features
      are not set yet, preventing to correctly select the default elevator
      most suitable for the device.
      
      This currently affects all multi-queue zoned block devices which default
      to the "none" elevator instead of the required "mq-deadline" elevator.
      These drives currently include host-managed SMR disks connected to a
      smartpqi HBA and null_blk block devices with zoned mode enabled.
      Upcoming NVMe Zoned Namespace devices will also be affected.
      
      Fix this by adding the boolean elevator_init argument to
      blk_mq_init_allocated_queue() to control the execution of
      elevator_init_mq(). Two cases exist:
      1) elevator_init = false is used for calls to
         blk_mq_init_allocated_queue() within blk_mq_init_queue(). In this
         case, a call to elevator_init_mq() is added to __device_add_disk(),
         resulting in the delayed initialization of the queue elevator
         after the device driver finished probing the device information. This
         effectively allows elevator_init_mq() access to more information
         about the device.
      2) elevator_init = true preserves the current behavior of initializing
         the elevator directly from blk_mq_init_allocated_queue(). This case
         is used for the special request based DM devices where the device
         gendisk is created before the queue initialization and device
         information (e.g. queue limits) is already known when the queue
         initialization is executed.
      
      Additionally, to make sure that the elevator initialization is never
      done while requests are in-flight (there should be none when the device
      driver calls device_add_disk()), freeze and quiesce the device request
      queue before calling blk_mq_init_sched() in elevator_init_mq().
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      737eb78e
    • H
      dm writecache: skip writecache_wait for pmem mode · 6d195913
      Huaisheng Ye 提交于
      The array bio_in_progress[2] only have chance to be increased and
      decreased with ssd mode. For pmem mode, they are not involved at all.
      So skip writecache_wait_for_ios in writecache_flush for pmem.
      Suggested-by: NDoris Yu <tyu1@lenovo.com>
      Signed-off-by: NHuaisheng Ye <yehs1@lenovo.com>
      Acked-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      6d195913
  11. 04 9月, 2019 6 次提交
    • G
      dm stats: use struct_size() helper · fb16c799
      Gustavo A. R. Silva 提交于
      One of the more common cases of allocation size calculations is finding
      the size of a structure that has a zero-sized array at the end, along
      with memory for some number of elements for that array. For example:
      
      struct dm_stat {
      	...
              struct dm_stat_shared stat_shared[0];
      };
      
      Make use of the struct_size() helper instead of an open-coded version
      in order to avoid any potential type mistakes.
      
      So, replace the following form:
      
      sizeof(struct dm_stat) + (size_t)n_entries * sizeof(struct dm_stat_shared)
      
      with:
      
      struct_size(s, stat_shared, n_entries)
      
      This code was detected with the help of Coccinelle.
      Signed-off-by: NGustavo A. R. Silva <gustavo@embeddedor.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      fb16c799
    • G
      md/raid5: use bio_end_sector to calculate last_sector · b0f01ecf
      Guoqing Jiang 提交于
      Use the common way to get last_sector.
      Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      b0f01ecf
    • Y
      md/raid1: fail run raid1 array when active disk less than one · 07f1a685
      Yufen Yu 提交于
      When run test case:
        mdadm -CR /dev/md1 -l 1 -n 4 /dev/sd[a-d] --assume-clean --bitmap=internal
        mdadm -S /dev/md1
        mdadm -A /dev/md1 /dev/sd[b-c] --run --force
      
        mdadm --zero /dev/sda
        mdadm /dev/md1 -a /dev/sda
      
        echo offline > /sys/block/sdc/device/state
        echo offline > /sys/block/sdb/device/state
        sleep 5
        mdadm -S /dev/md1
      
        echo running > /sys/block/sdb/device/state
        echo running > /sys/block/sdc/device/state
        mdadm -A /dev/md1 /dev/sd[a-c] --run --force
      
      mdadm run fail with kernel message as follow:
      [  172.986064] md: kicking non-fresh sdb from array!
      [  173.004210] md: kicking non-fresh sdc from array!
      [  173.022383] md/raid1:md1: active with 0 out of 4 mirrors
      [  173.022406] md1: failed to create bitmap (-5)
      
      In fact, when active disk in raid1 array less than one, we
      need to return fail in raid1_run().
      Reviewed-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NYufen Yu <yuyufen@huawei.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      07f1a685
    • G
      md raid0/linear: Mark array as 'broken' and fail BIOs if a member is gone · 62f7b198
      Guilherme G. Piccoli 提交于
      Currently md raid0/linear are not provided with any mechanism to validate
      if an array member got removed or failed. The driver keeps sending BIOs
      regardless of the state of array members, and kernel shows state 'clean'
      in the 'array_state' sysfs attribute. This leads to the following
      situation: if a raid0/linear array member is removed and the array is
      mounted, some user writing to this array won't realize that errors are
      happening unless they check dmesg or perform one fsync per written file.
      Despite udev signaling the member device is gone, 'mdadm' cannot issue the
      STOP_ARRAY ioctl successfully, given the array is mounted.
      
      In other words, no -EIO is returned and writes (except direct ones) appear
      normal. Meaning the user might think the wrote data is correctly stored in
      the array, but instead garbage was written given that raid0 does stripping
      (and so, it requires all its members to be working in order to not corrupt
      data). For md/linear, writes to the available members will work fine, but
      if the writes go to the missing member(s), it'll cause a file corruption
      situation, whereas the portion of the writes to the missing devices aren't
      written effectively.
      
      This patch changes this behavior: we check if the block device's gendisk
      is UP when submitting the BIO to the array member, and if it isn't, we flag
      the md device as MD_BROKEN and fail subsequent I/Os to that device; a read
      request to the array requiring data from a valid member is still completed.
      While flagging the device as MD_BROKEN, we also show a rate-limited warning
      in the kernel log.
      
      A new array state 'broken' was added too: it mimics the state 'clean' in
      every aspect, being useful only to distinguish if the array has some member
      missing. We rely on the MD_BROKEN flag to put the array in the 'broken'
      state. This state cannot be written in 'array_state' as it just shows
      one or more members of the array are missing but acts like 'clean', it
      wouldn't make sense to write it.
      
      With this patch, the filesystem reacts much faster to the event of missing
      array member: after some I/O errors, ext4 for instance aborts the journal
      and prevents corruption. Without this change, we're able to keep writing
      in the disk and after a machine reboot, e2fsck shows some severe fs errors
      that demand fixing. This patch was tested in ext4 and xfs filesystems, and
      requires a 'mdadm' counterpart to handle the 'broken' state.
      
      Cc: Song Liu <songliubraving@fb.com>
      Reviewed-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NGuilherme G. Piccoli <gpiccoli@canonical.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      62f7b198
    • A
      dm crypt: omit parsing of the encapsulated cipher · b1d1e296
      Ard Biesheuvel 提交于
      Only the ESSIV IV generation mode used to use cc->cipher so it could
      instantiate the bare cipher used to encrypt the IV. However, this is
      now taken care of by the ESSIV template, and so no users of cc->cipher
      remain. So remove it altogether.
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Tested-by: NMilan Broz <gmazyland@gmail.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      b1d1e296
    • A
      dm crypt: switch to ESSIV crypto API template · a1a262b6
      Ard Biesheuvel 提交于
      Replace the explicit ESSIV handling in the dm-crypt driver with calls
      into the crypto API, which now possesses the capability to perform
      this processing within the crypto subsystem.
      
      Note that we reorder the AEAD cipher_api string parsing with the TFM
      instantiation: this is needed because cipher_api is mangled by the
      ESSIV handling, and throws off the parsing of "authenc(" otherwise.
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Tested-by: NMilan Broz <gmazyland@gmail.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      a1a262b6
  12. 03 9月, 2019 3 次提交
  13. 28 8月, 2019 3 次提交
    • N
      raid5 improve too many read errors msg by adding limits · 0009fad0
      Nigel Croxon 提交于
      Often limits can be changed by admin. When discussing such things
      it helps if you can provide "self-sustained" facts. Also
      sometimes the admin thinks he changed a limit, but it did not
      take effect for some reason or he changed the wrong thing.
      
      V3: Only pr_warn when Faulty is 0.
      V2: Add read_errors value to pr_warn.
      Signed-off-by: NNigel Croxon <ncroxon@redhat.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      0009fad0
    • N
      md: don't report active array_state until after revalidate_disk() completes. · 9d4b45d6
      NeilBrown 提交于
      Until revalidate_disk() has completed, the size of a new md array will
      appear to be zero.
      So we shouldn't report, through array_state, that the array is active
      until that time.
      udev rules check array_state to see if the array is ready.  As soon as
      it appear to be zero, fsck can be run.  If it find the size to be
      zero, it will fail.
      
      So add a new flag to provide an interlock between do_md_run() and
      array_state_show().  This flag is set while do_md_run() is active and
      it prevents array_state_show() from reporting that the array is
      active.
      
      Before do_md_run() is called, ->pers will be NULL so array is
      definitely not active.
      After do_md_run() is called, revalidate_disk() will have run and the
      array will be completely ready.
      
      We also move various sysfs_notify*() calls out of md_run() into
      do_md_run() after MD_NOT_READY is cleared.  This ensure the
      information is ready before the notification is sent.
      
      Prior to v4.12, array_state_show() was called with the
      mddev->reconfig_mutex held, which provided exclusion with do_md_run().
      
      Note that MD_NOT_READY cleared twice.  This is deliberate to cover
      both success and error paths with minimal noise.
      
      Fixes: b7b17c9b ("md: remove mddev_lock() from md_attr_show()")
      Cc: stable@vger.kernel.org (v4.12++)
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      9d4b45d6
    • N
      md: only call set_in_sync() when it is expected to succeed. · 480523fe
      NeilBrown 提交于
      Since commit 4ad23a97 ("MD: use per-cpu counter for
      writes_pending"), set_in_sync() is substantially more expensive: it
      can wait for a full RCU grace period which can be 10s of milliseconds.
      
      So we should only call it when the cost is justified.
      
      md_check_recovery() currently calls set_in_sync() every time it finds
      anything to do (on non-external active arrays).  For an array
      performing resync or recovery, this will be quite often.
      Each call will introduce a delay to the md thread, which can noticeable
      affect IO submission latency.
      
      In md_check_recovery() we only need to call set_in_sync() if
      'safemode' was non-zero at entry, meaning that there has been not
      recent IO.  So we save this "safemode was nonzero" state, and only
      call set_in_sync() if it was non-zero.
      
      This measurably reduces mean and maximum IO submission latency during
      resync/recovery.
      Reported-and-tested-by: NJack Wang <jinpu.wang@cloud.ionos.com>
      Fixes: 4ad23a97 ("MD: use per-cpu counter for writes_pending")
      Cc: stable@vger.kernel.org (v4.12+)
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      480523fe
  14. 27 8月, 2019 1 次提交
  15. 26 8月, 2019 3 次提交
    • G
      dm raid1: use struct_size() with kzalloc() · bcd67654
      Gustavo A. R. Silva 提交于
      One of the more common cases of allocation size calculations is finding
      the size of a structure that has a zero-sized array at the end, along
      with memory for some number of elements for that array. For example:
      
      struct mirror_set {
      	...
              struct mirror mirror[0];
      };
      
      size = sizeof(struct mirror_set) + count * sizeof(struct mirror);
      instance = kzalloc(size, GFP_KERNEL)
      
      Instead of leaving these open-coded and prone to type mistakes, we can
      now use the new struct_size() helper:
      
      instance = kzalloc(struct_size(instance, mirror, count), GFP_KERNEL)
      
      Notice that, in this case, variable len is not necessary, hence it
      is removed.
      
      This code was detected with the help of Coccinelle.
      Signed-off-by: NGustavo A. R. Silva <gustavo@embeddedor.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      bcd67654
    • H
      dm writecache: optimize performance by sorting the blocks for writeback_all · 5229b489
      Huaisheng Ye 提交于
      During the process of writeback, the blocks, which have been placed in wbl.list
      for writeback soon, are partially ordered for the contiguous ones.
      
      When writeback_all has been set, for most cases, also by default, there will be
      a lot of blocks in pmem need to writeback at the same time.
      For this case, we could optimize the performance by sorting all blocks in
      wbl.list. writecache_writeback doesn't need to get blocks from the tail of
      wc->lru, whereas from the first rb_node from the rb_tree.
      
      The benefit is that, writecache_writeback doesn't need to have any cost to sort
      the blocks, because of all blocks are incremental originally in rb_tree.
      There will be a writecache_flush when writeback_all begins to work, that will
      eliminate duplicate blocks in cache by committed/uncommitted.
      
      Testing platform: Thinksystem SR630 with persistent memory.
      The cache comes from pmem, which has 1006MB size. The origin device is HDD, 2GB
      of which for using.
      
      Testing steps:
       1) dmsetup create mycache --table '0 4194304 writecache p /dev/sdb1 /dev/pmem4  4096 0'
       2) fio -filename=/dev/mapper/mycache -direct=1 -iodepth=20 -rw=randwrite
       -ioengine=libaio -bs=4k -loops=1  -size=2g -group_reporting -name=mytest1
       3) time dmsetup message /dev/mapper/mycache 0 flush
      
      Here is the results below,
      With the patch:
       # fio -filename=/dev/mapper/mycache -direct=1 -iodepth=20 -rw=randwrite
       -ioengine=libaio -bs=4k -loops=1  -size=2g -group_reporting -name=mytest1
         iops        : min= 1582, max=199470, avg=5305.94, stdev=21273.44, samples=197
       # time dmsetup message /dev/mapper/mycache 0 flush
      real	0m44.020s
      user	0m0.002s
      sys	0m0.003s
      
      Without the patch:
       # fio -filename=/dev/mapper/mycache -direct=1 -iodepth=20 -rw=randwrite
       -ioengine=libaio -bs=4k -loops=1  -size=2g -group_reporting -name=mytest1
         iops        : min= 1202, max=197650, avg=4968.67, stdev=20480.17, samples=211
       # time dmsetup message /dev/mapper/mycache 0 flush
      real	1m39.221s
      user	0m0.001s
      sys	0m0.003s
      
      I also have checked the data accuracy with this patch by making EXT4 filesystem
      on mycache, then mount it for checking md5 of files on that.
      The test result is positive, with this patch it could save more than half of time
      when writeback_all.
      Signed-off-by: NHuaisheng Ye <yehs1@lenovo.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      5229b489
    • H
      dm writecache: add unlikely for getting two block with same LBA · 62421b38
      Huaisheng Ye 提交于
      In function writecache_writeback, entries g and f has same original
      sector only happens at entry f has been committed, but entry g has
      NOT yet.
      
      The probability of this happening is very low in the following
      256 blocks at most of entry e.
      Signed-off-by: NHuaisheng Ye <yehs1@lenovo.com>
      Acked-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      62421b38