1. 25 10月, 2019 1 次提交
    • D
      md: improve handling of bio with REQ_PREFLUSH in md_flush_request() · 775d7831
      David Jeffery 提交于
      If pers->make_request fails in md_flush_request(), the bio is lost. To
      fix this, pass back a bool to indicate if the original make_request call
      should continue to handle the I/O and instead of assuming the flush logic
      will push it to completion.
      
      Convert md_flush_request to return a bool and no longer calls the raid
      driver's make_request function.  If the return is true, then the md flush
      logic has or will complete the bio and the md make_request call is done.
      If false, then the md make_request function needs to keep processing like
      it is a normal bio. Let the original call to md_handle_request handle any
      need to retry sending the bio to the raid driver's make_request function
      should it be needed.
      
      Also mark md_flush_request and the make_request function pointer as
      __must_check to issue warnings should these critical return values be
      ignored.
      
      Fixes: 2bc13b83 ("md: batch flush requests.")
      Cc: stable@vger.kernel.org # # v4.19+
      Cc: NeilBrown <neilb@suse.com>
      Signed-off-by: NDavid Jeffery <djeffery@redhat.com>
      Reviewed-by: NXiao Ni <xni@redhat.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      775d7831
  2. 14 9月, 2019 3 次提交
    • G
      raid5: remove STRIPE_OPS_REQ_PENDING · feb9bf98
      Guoqing Jiang 提交于
      This stripe state is not used anymore after commit 51acbcec
      ("md: remove CONFIG_MULTICORE_RAID456"), so remove the obsoleted
      state.
      
      gjiang@nb01257:~/md$ grep STRIPE_OPS_REQ_PENDING drivers/md/ -r
      drivers/md/raid5.c:					  (1 << STRIPE_OPS_REQ_PENDING) |
      drivers/md/raid5.h:	STRIPE_OPS_REQ_PENDING,
      Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      feb9bf98
    • G
      raid5: don't set STRIPE_HANDLE to stripe which is in batch list · 6ce220dd
      Guoqing Jiang 提交于
      If stripe in batch list is set with STRIPE_HANDLE flag, then the stripe
      could be set with STRIPE_ACTIVE by the handle_stripe function. And if
      error happens to the batch_head at the same time, break_stripe_batch_list
      is called, then below warning could happen (the same report in [1]), it
      means a member of batch list was set with STRIPE_ACTIVE.
      
      [7028915.431770] stripe state: 2001
      [7028915.431815] ------------[ cut here ]------------
      [7028915.431828] WARNING: CPU: 18 PID: 29089 at drivers/md/raid5.c:4614 break_stripe_batch_list+0x203/0x240 [raid456]
      [...]
      [7028915.431879] CPU: 18 PID: 29089 Comm: kworker/u82:5 Tainted: G           O    4.14.86-1-storage #4.14.86-1.2~deb9
      [7028915.431881] Hardware name: Supermicro SSG-2028R-ACR24L/X10DRH-iT, BIOS 3.1 06/18/2018
      [7028915.431888] Workqueue: raid5wq raid5_do_work [raid456]
      [7028915.431890] task: ffff9ab0ef36d7c0 task.stack: ffffb72926f84000
      [7028915.431896] RIP: 0010:break_stripe_batch_list+0x203/0x240 [raid456]
      [7028915.431898] RSP: 0018:ffffb72926f87ba8 EFLAGS: 00010286
      [7028915.431900] RAX: 0000000000000012 RBX: ffff9aaa84a98000 RCX: 0000000000000000
      [7028915.431901] RDX: 0000000000000000 RSI: ffff9ab2bfa15458 RDI: ffff9ab2bfa15458
      [7028915.431902] RBP: ffff9aaa8fb4e900 R08: 0000000000000001 R09: 0000000000002eb4
      [7028915.431903] R10: 00000000ffffffff R11: 0000000000000000 R12: ffff9ab1736f1b00
      [7028915.431904] R13: 0000000000000000 R14: ffff9aaa8fb4e900 R15: 0000000000000001
      [7028915.431906] FS:  0000000000000000(0000) GS:ffff9ab2bfa00000(0000) knlGS:0000000000000000
      [7028915.431907] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [7028915.431908] CR2: 00007ff953b9f5d8 CR3: 0000000bf4009002 CR4: 00000000003606e0
      [7028915.431909] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [7028915.431910] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [7028915.431910] Call Trace:
      [7028915.431923]  handle_stripe+0x8e7/0x2020 [raid456]
      [7028915.431930]  ? __wake_up_common_lock+0x89/0xc0
      [7028915.431935]  handle_active_stripes.isra.58+0x35f/0x560 [raid456]
      [7028915.431939]  raid5_do_work+0xc6/0x1f0 [raid456]
      
      Also commit 59fc630b ("RAID5: batch adjacent full stripe write")
      said "If a stripe is added to batch list, then only the first stripe
      of the list should be put to handle_list and run handle_stripe."
      
      So don't set STRIPE_HANDLE to stripe which is already in batch list,
      otherwise the stripe could be put to handle_list and run handle_stripe,
      then the above warning could be triggered.
      
      [1]. https://www.spinics.net/lists/raid/msg62552.htmlSigned-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      6ce220dd
    • N
      raid5: don't increment read_errors on EILSEQ return · b76b4715
      Nigel Croxon 提交于
      While MD continues to count read errors returned by the lower layer.
      If those errors are -EILSEQ, instead of -EIO, it should NOT increase
      the read_errors count.
      
      When RAID6 is set up on dm-integrity target that detects massive
      corruption, the leg will be ejected from the array.  Even if the
      issue is correctable with a sector re-write and the array has
      necessary redundancy to correct it.
      
      The leg is ejected because it runs up the rdev->read_errors beyond
      conf->max_nr_stripes.  The return status in dm-drypt when there is
      a data integrity error is -EILSEQ (BLK_STS_PROTECTION).
      Signed-off-by: NNigel Croxon <ncroxon@redhat.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      b76b4715
  3. 04 9月, 2019 1 次提交
  4. 28 8月, 2019 1 次提交
  5. 08 8月, 2019 1 次提交
    • X
      md/raid6: Set R5_ReadError when there is read failure on parity disk · 143f6e73
      Xiao Ni 提交于
      7471fb77 ("md/raid6: Fix anomily when recovering a single device in
      RAID6.") avoids rereading P when it can be computed from other members.
      However, this misses the chance to re-write the right data to P. This
      patch sets R5_ReadError if the re-read fails.
      
      Also, when re-read is skipped, we also missed the chance to reset
      rdev->read_errors to 0. It can fail the disk when there are many read
      errors on P member disk (other disks don't have read error)
      
      V2: upper layer read request don't read parity/Q data. So there is no
      need to consider such situation.
      
      This is Reported-by: kbuild test robot <lkp@intel.com>
      
      Fixes: 7471fb77 ("md/raid6: Fix anomily when recovering a single device in RAID6.")
      Cc: <stable@vger.kernel.org> #4.4+
      Signed-off-by: NXiao Ni <xni@redhat.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      143f6e73
  6. 21 6月, 2019 1 次提交
    • C
      block: remove the bi_phys_segments field in struct bio · 14ccb66b
      Christoph Hellwig 提交于
      We only need the number of segments in the blk-mq submission path.
      Remove the field from struct bio, and return it from a variant of
      blk_queue_split instead of that it can passed as an argument to
      those functions that need the value.
      
      This also means we stop recounting segments except for cloning
      and partial segments.
      
      To keep the number of arguments in this how path down remove
      pointless struct request_queue arguments from any of the functions
      that had it and grew a nr_segs argument.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      14ccb66b
  7. 15 6月, 2019 1 次提交
  8. 24 5月, 2019 1 次提交
  9. 17 4月, 2019 2 次提交
  10. 11 4月, 2019 2 次提交
  11. 02 4月, 2019 1 次提交
    • N
      Don't jump to compute_result state from check_result state · 4f4fd7c5
      Nigel Croxon 提交于
      Changing state from check_state_check_result to
      check_state_compute_result not only is unsafe but also doesn't
      appear to serve a valid purpose.  A raid6 check should only be
      pushing out extra writes if doing repair and a mis-match occurs.
      The stripe dev management will already try and do repair writes
      for failing sectors.
      
      This patch makes the raid6 check_state_check_result handling
      work more like raid5's.  If somehow too many failures for a
      check, just quit the check operation for the stripe.  When any
      checks pass, don't try and use check_state_compute_result for
      a purpose it isn't needed for and is unsafe for.  Just mark the
      stripe as in sync for passing its parity checks and let the
      stripe dev read/write code and the bad blocks list do their
      job handling I/O errors.
      
      Repro steps from Xiao:
      
      These are the steps to reproduce this problem:
      1. redefined OPT_MEDIUM_ERR_ADDR to 12000 in scsi_debug.c
      2. insmod scsi_debug.ko dev_size_mb=11000  max_luns=1 num_tgts=1
      3. mdadm --create /dev/md127 --level=6 --raid-devices=5 /dev/sde1 /dev/sde2 /dev/sde3 /dev/sde5 /dev/sde6
      sde is the disk created by scsi_debug
      4. echo "2" >/sys/module/scsi_debug/parameters/opts
      5. raid-check
      
      It panic:
      [ 4854.730899] md: data-check of RAID array md127
      [ 4854.857455] sd 5:0:0:0: [sdr] tag#80 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
      [ 4854.859246] sd 5:0:0:0: [sdr] tag#80 Sense Key : Medium Error [current]
      [ 4854.860694] sd 5:0:0:0: [sdr] tag#80 Add. Sense: Unrecovered read error
      [ 4854.862207] sd 5:0:0:0: [sdr] tag#80 CDB: Read(10) 28 00 00 00 2d 88 00 04 00 00
      [ 4854.864196] print_req_error: critical medium error, dev sdr, sector 11656 flags 0
      [ 4854.867409] sd 5:0:0:0: [sdr] tag#100 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
      [ 4854.869469] sd 5:0:0:0: [sdr] tag#100 Sense Key : Medium Error [current]
      [ 4854.871206] sd 5:0:0:0: [sdr] tag#100 Add. Sense: Unrecovered read error
      [ 4854.872858] sd 5:0:0:0: [sdr] tag#100 CDB: Read(10) 28 00 00 00 2e e0 00 00 08 00
      [ 4854.874587] print_req_error: critical medium error, dev sdr, sector 12000 flags 4000
      [ 4854.876456] sd 5:0:0:0: [sdr] tag#101 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
      [ 4854.878552] sd 5:0:0:0: [sdr] tag#101 Sense Key : Medium Error [current]
      [ 4854.880278] sd 5:0:0:0: [sdr] tag#101 Add. Sense: Unrecovered read error
      [ 4854.881846] sd 5:0:0:0: [sdr] tag#101 CDB: Read(10) 28 00 00 00 2e e8 00 00 08 00
      [ 4854.883691] print_req_error: critical medium error, dev sdr, sector 12008 flags 4000
      [ 4854.893927] sd 5:0:0:0: [sdr] tag#166 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
      [ 4854.896002] sd 5:0:0:0: [sdr] tag#166 Sense Key : Medium Error [current]
      [ 4854.897561] sd 5:0:0:0: [sdr] tag#166 Add. Sense: Unrecovered read error
      [ 4854.899110] sd 5:0:0:0: [sdr] tag#166 CDB: Read(10) 28 00 00 00 2e e0 00 00 10 00
      [ 4854.900989] print_req_error: critical medium error, dev sdr, sector 12000 flags 0
      [ 4854.902757] md/raid:md127: read error NOT corrected!! (sector 9952 on sdr1).
      [ 4854.904375] md/raid:md127: read error NOT corrected!! (sector 9960 on sdr1).
      [ 4854.906201] ------------[ cut here ]------------
      [ 4854.907341] kernel BUG at drivers/md/raid5.c:4190!
      
      raid5.c:4190 above is this BUG_ON:
      
          handle_parity_checks6()
              ...
              BUG_ON(s->uptodate < disks - 1); /* We don't need Q to recover */
      
      Cc: <stable@vger.kernel.org> # v3.16+
      OriginalAuthor: David Jeffery <djeffery@redhat.com>
      Cc: Xiao Ni <xni@redhat.com>
      Tested-by: NDavid Jeffery <djeffery@redhat.com>
      Signed-off-by: NDavid Jeffy <djeffery@redhat.com>
      Signed-off-by: NNigel Croxon <ncroxon@redhat.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4f4fd7c5
  12. 13 3月, 2019 3 次提交
  13. 29 1月, 2019 1 次提交
    • A
      md/raid5: fix 'out of memory' during raid cache recovery · 483cbbed
      Alexei Naberezhnov 提交于
      This fixes the case when md array assembly fails because of raid cache recovery
      unable to allocate a stripe, despite attempts to replay stripes and increase
      cache size. This happens because stripes released by r5c_recovery_replay_stripes
      and raid5_set_cache_size don't become available for allocation immediately.
      Released stripes first are placed on conf->released_stripes list and require
      md thread to merge them on conf->inactive_list before they can be allocated.
      
      Patch allows final allocation attempt during cache recovery to wait for
      new stripes to become availabe for allocation.
      
      Cc: linux-raid@vger.kernel.org
      Cc: Shaohua Li <shli@kernel.org>
      Cc: linux-stable <stable@vger.kernel.org> # 4.10+
      Fixes: b4c625c6 ("md/r5cache: r5cache recovery: part 1")
      Signed-off-by: NAlexei Naberezhnov <anaberezhnov@fb.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      483cbbed
  14. 29 9月, 2018 1 次提交
    • M
      raid5: block failing device if raid will be failed · fb73b357
      Mariusz Tkaczyk 提交于
      Currently there is an inconsistency for failing the member drives
      for arrays with different RAID levels. For RAID456 - there is a possibility
      to fail all of the devices. However - for other RAID levels - kernel blocks
      removing the member drive, if the operation results in array's FAIL state
      (EBUSY is returned). For example - removing last drive from RAID1 is not
      possible.
      This kind of blocker was never implemented for raid456 and we cannot see
      the reason why.
      
      We had tested following patch and did not observe any regression, so do you
      have any comments/reasons for current approach, or we can send the proper
      patch for this?
      Signed-off-by: NMariusz Tkaczyk <mariusz.tkaczyk@intel.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      fb73b357
  15. 01 9月, 2018 1 次提交
  16. 03 8月, 2018 1 次提交
    • B
      md/raid5: fix data corruption of replacements after originals dropped · d63e2fc8
      BingJing Chang 提交于
      During raid5 replacement, the stripes can be marked with R5_NeedReplace
      flag. Data can be read from being-replaced devices and written to
      replacing spares without reading all other devices. (It's 'replace'
      mode. s.replacing = 1) If a being-replaced device is dropped, the
      replacement progress will be interrupted and resumed with pure recovery
      mode. However, existing stripes before being interrupted cannot read
      from the dropped device anymore. It prints lots of WARN_ON messages.
      And it results in data corruption because existing stripes write
      problematic data into its replacement device and update the progress.
      
      \# Erase disks (1MB + 2GB)
      dd if=/dev/zero of=/dev/sda bs=1MB count=2049
      dd if=/dev/zero of=/dev/sdb bs=1MB count=2049
      dd if=/dev/zero of=/dev/sdc bs=1MB count=2049
      dd if=/dev/zero of=/dev/sdd bs=1MB count=2049
      mdadm -C /dev/md0 -amd -R -l5 -n3 -x0 /dev/sd[abc] -z 2097152
      \# Ensure array stores non-zero data
      dd if=/root/data_4GB.iso of=/dev/md0 bs=1MB
      \# Start replacement
      mdadm /dev/md0 -a /dev/sdd
      mdadm /dev/md0 --replace /dev/sda
      
      Then, Hot-plug out /dev/sda during recovery, and wait for recovery done.
      echo check > /sys/block/md0/md/sync_action
      cat /sys/block/md0/md/mismatch_cnt # it will be greater than 0.
      
      Soon after you hot-plug out /dev/sda, you will see many WARN_ON
      messages. The replacement recovery will be interrupted shortly. After
      the recovery finishes, it will result in data corruption.
      
      Actually, it's just an unhandled case of replacement. In commit
      <f94c0b66> (md/raid5: fix interaction of 'replace' and 'recovery'.),
      if a NeedReplace device is not UPTODATE then that is an error, the
      commit just simply print WARN_ON but also mark these corrupted stripes
      with R5_WantReplace. (it means it's ready for writes.)
      
      To fix this case, we can leverage 'sync and replace' mode mentioned in
      commit <9a3e1101> (md/raid5: detect and handle replacements during
      recovery.). We can add logics to detect and use 'sync and replace' mode
      for these stripes.
      Reported-by: NAlex Chen <alexchen@synology.com>
      Reviewed-by: NAlex Wu <alexwu@synology.com>
      Reviewed-by: NChung-Chiang Cheng <cccheng@synology.com>
      Signed-off-by: NBingJing Chang <bingjingc@synology.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      d63e2fc8
  17. 02 8月, 2018 1 次提交
  18. 24 7月, 2018 1 次提交
  19. 19 7月, 2018 1 次提交
  20. 13 6月, 2018 1 次提交
    • K
      treewide: kzalloc() -> kcalloc() · 6396bb22
      Kees Cook 提交于
      The kzalloc() function has a 2-factor argument form, kcalloc(). This
      patch replaces cases of:
      
              kzalloc(a * b, gfp)
      
      with:
              kcalloc(a * b, gfp)
      
      as well as handling cases of:
      
              kzalloc(a * b * c, gfp)
      
      with:
      
              kzalloc(array3_size(a, b, c), gfp)
      
      as it's slightly less ugly than:
      
              kzalloc_array(array_size(a, b), c, gfp)
      
      This does, however, attempt to ignore constant size factors like:
      
              kzalloc(4 * 1024, gfp)
      
      though any constants defined via macros get caught up in the conversion.
      
      Any factors with a sizeof() of "unsigned char", "char", and "u8" were
      dropped, since they're redundant.
      
      The Coccinelle script used for this was:
      
      // Fix redundant parens around sizeof().
      @@
      type TYPE;
      expression THING, E;
      @@
      
      (
        kzalloc(
      -	(sizeof(TYPE)) * E
      +	sizeof(TYPE) * E
        , ...)
      |
        kzalloc(
      -	(sizeof(THING)) * E
      +	sizeof(THING) * E
        , ...)
      )
      
      // Drop single-byte sizes and redundant parens.
      @@
      expression COUNT;
      typedef u8;
      typedef __u8;
      @@
      
      (
        kzalloc(
      -	sizeof(u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kzalloc(
      -	sizeof(__u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kzalloc(
      -	sizeof(char) * (COUNT)
      +	COUNT
        , ...)
      |
        kzalloc(
      -	sizeof(unsigned char) * (COUNT)
      +	COUNT
        , ...)
      |
        kzalloc(
      -	sizeof(u8) * COUNT
      +	COUNT
        , ...)
      |
        kzalloc(
      -	sizeof(__u8) * COUNT
      +	COUNT
        , ...)
      |
        kzalloc(
      -	sizeof(char) * COUNT
      +	COUNT
        , ...)
      |
        kzalloc(
      -	sizeof(unsigned char) * COUNT
      +	COUNT
        , ...)
      )
      
      // 2-factor product with sizeof(type/expression) and identifier or constant.
      @@
      type TYPE;
      expression THING;
      identifier COUNT_ID;
      constant COUNT_CONST;
      @@
      
      (
      - kzalloc
      + kcalloc
        (
      -	sizeof(TYPE) * (COUNT_ID)
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(TYPE) * COUNT_ID
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(TYPE) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(TYPE) * COUNT_CONST
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(THING) * (COUNT_ID)
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(THING) * COUNT_ID
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(THING) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(THING)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(THING) * COUNT_CONST
      +	COUNT_CONST, sizeof(THING)
        , ...)
      )
      
      // 2-factor product, only identifiers.
      @@
      identifier SIZE, COUNT;
      @@
      
      - kzalloc
      + kcalloc
        (
      -	SIZE * COUNT
      +	COUNT, SIZE
        , ...)
      
      // 3-factor product with 1 sizeof(type) or sizeof(expression), with
      // redundant parens removed.
      @@
      expression THING;
      identifier STRIDE, COUNT;
      type TYPE;
      @@
      
      (
        kzalloc(
      -	sizeof(TYPE) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kzalloc(
      -	sizeof(TYPE) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kzalloc(
      -	sizeof(TYPE) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kzalloc(
      -	sizeof(TYPE) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kzalloc(
      -	sizeof(THING) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kzalloc(
      -	sizeof(THING) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kzalloc(
      -	sizeof(THING) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kzalloc(
      -	sizeof(THING) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      )
      
      // 3-factor product with 2 sizeof(variable), with redundant parens removed.
      @@
      expression THING1, THING2;
      identifier COUNT;
      type TYPE1, TYPE2;
      @@
      
      (
        kzalloc(
      -	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kzalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kzalloc(
      -	sizeof(THING1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kzalloc(
      -	sizeof(THING1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kzalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      |
        kzalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      )
      
      // 3-factor product, only identifiers, with redundant parens removed.
      @@
      identifier STRIDE, SIZE, COUNT;
      @@
      
      (
        kzalloc(
      -	(COUNT) * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc(
      -	COUNT * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc(
      -	COUNT * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc(
      -	(COUNT) * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc(
      -	COUNT * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc(
      -	(COUNT) * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc(
      -	(COUNT) * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc(
      -	COUNT * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      )
      
      // Any remaining multi-factor products, first at least 3-factor products,
      // when they're not all constants...
      @@
      expression E1, E2, E3;
      constant C1, C2, C3;
      @@
      
      (
        kzalloc(C1 * C2 * C3, ...)
      |
        kzalloc(
      -	(E1) * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kzalloc(
      -	(E1) * (E2) * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kzalloc(
      -	(E1) * (E2) * (E3)
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kzalloc(
      -	E1 * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      )
      
      // And then all remaining 2 factors products when they're not all constants,
      // keeping sizeof() as the second factor argument.
      @@
      expression THING, E1, E2;
      type TYPE;
      constant C1, C2, C3;
      @@
      
      (
        kzalloc(sizeof(THING) * C2, ...)
      |
        kzalloc(sizeof(TYPE) * C2, ...)
      |
        kzalloc(C1 * C2 * C3, ...)
      |
        kzalloc(C1 * C2, ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(TYPE) * (E2)
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(TYPE) * E2
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(THING) * (E2)
      +	E2, sizeof(THING)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(THING) * E2
      +	E2, sizeof(THING)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	(E1) * E2
      +	E1, E2
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	(E1) * (E2)
      +	E1, E2
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	E1 * E2
      +	E1, E2
        , ...)
      )
      Signed-off-by: NKees Cook <keescook@chromium.org>
      6396bb22
  21. 31 5月, 2018 1 次提交
  22. 18 5月, 2018 2 次提交
  23. 09 3月, 2018 1 次提交
  24. 26 2月, 2018 1 次提交
    • B
      md: fix a potential deadlock of raid5/raid10 reshape · 8876391e
      BingJing Chang 提交于
      There is a potential deadlock if mount/umount happens when
      raid5_finish_reshape() tries to grow the size of emulated disk.
      
      How the deadlock happens?
      1) The raid5 resync thread finished reshape (expanding array).
      2) The mount or umount thread holds VFS sb->s_umount lock and tries to
         write through critical data into raid5 emulated block device. So it
         waits for raid5 kernel thread handling stripes in order to finish it
         I/Os.
      3) In the routine of raid5 kernel thread, md_check_recovery() will be
         called first in order to reap the raid5 resync thread. That is,
         raid5_finish_reshape() will be called. In this function, it will try
         to update conf and call VFS revalidate_disk() to grow the raid5
         emulated block device. It will try to acquire VFS sb->s_umount lock.
      The raid5 kernel thread cannot continue, so no one can handle mount/
      umount I/Os (stripes). Once the write-through I/Os cannot be finished,
      mount/umount will not release sb->s_umount lock. The deadlock happens.
      
      The raid5 kernel thread is an emulated block device. It is responible to
      handle I/Os (stripes) from upper layers. The emulated block device
      should not request any I/Os on itself. That is, it should not call VFS
      layer functions. (If it did, it will try to acquire VFS locks to
      guarantee the I/Os sequence.) So we have the resync thread to send
      resync I/O requests and to wait for the results.
      
      For solving this potential deadlock, we can put the size growth of the
      emulated block device as the final step of reshape thread.
      
      2017/12/29:
      Thanks to Guoqing Jiang <gqjiang@suse.com>,
      we confirmed that there is the same deadlock issue in raid10. It's
      reproducible and can be fixed by this patch. For raid10.c, we can remove
      the similar code to prevent deadlock as well since they has been called
      before.
      Reported-by: NAlex Wu <alexwu@synology.com>
      Reviewed-by: NAlex Wu <alexwu@synology.com>
      Reviewed-by: NChung-Chiang Cheng <cccheng@synology.com>
      Signed-off-by: NBingJing Chang <bingjingc@synology.com>
      Signed-off-by: NShaohua Li <sh.li@alibaba-inc.com>
      8876391e
  25. 22 2月, 2018 1 次提交
    • A
      md: raid5: avoid string overflow warning · 53b8d89d
      Arnd Bergmann 提交于
      gcc warns about a possible overflow of the kmem_cache string, when adding
      four characters to a string of the same length:
      
      drivers/md/raid5.c: In function 'setup_conf':
      drivers/md/raid5.c:2207:34: error: '-alt' directive writing 4 bytes into a region of size between 1 and 32 [-Werror=format-overflow=]
        sprintf(conf->cache_name[1], "%s-alt", conf->cache_name[0]);
                                        ^~~~
      drivers/md/raid5.c:2207:2: note: 'sprintf' output between 5 and 36 bytes into a destination of size 32
        sprintf(conf->cache_name[1], "%s-alt", conf->cache_name[0]);
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      If I'm counting correctly, we need 11 characters for the fixed part
      of the string and 18 characters for a 64-bit pointer (when no gendisk
      is used), so that leaves three characters for conf->level, which should
      always be sufficient.
      
      This makes the code use snprintf() with the correct length, to
      make the code more robust against changes, and to get the compiler
      to shut up.
      
      In commit f4be6b43 ("md/raid5: ensure we create a unique name for
      kmem_cache when mddev has no gendisk") from 2010, Neil said that
      the pointer could be removed "shortly" once devices without gendisk
      are disallowed. I have no idea if that happened, but if it did, that
      should probably be changed as well.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NShaohua Li <sh.li@alibaba-inc.com>
      53b8d89d
  26. 18 2月, 2018 1 次提交
  27. 16 1月, 2018 1 次提交
    • T
      raid5-ppl: PPL support for disks with write-back cache enabled · 1532d9e8
      Tomasz Majchrzak 提交于
      In order to provide data consistency with PPL for disks with write-back
      cache enabled all data has to be flushed to disks before next PPL
      entry. The disks to be flushed are marked in the bitmap. It's modified
      under a mutex and it's only read after PPL io unit is submitted.
      
      A limitation of 64 disks in the array has been introduced to keep data
      structures and implementation simple. RAID5 arrays with so many disks are
      not likely due to high risk of multiple disks failure. Such restriction
      should not be a real life limitation.
      
      With write-back cache disabled next PPL entry is submitted when data write
      for current one completes. Data flush defers next log submission so trigger
      it when there are no stripes for handling found.
      
      As PPL assures all data is flushed to disk at request completion, just
      acknowledge flush request when PPL is enabled.
      Signed-off-by: NTomasz Majchrzak <tomasz.majchrzak@intel.com>
      Signed-off-by: NShaohua Li <sh.li@alibaba-inc.com>
      1532d9e8
  28. 12 12月, 2017 1 次提交
    • S
      md: introduce new personality funciton start() · d5d885fd
      Song Liu 提交于
      In do_md_run(), md threads should not wake up until the array is fully
      initialized in md_run(). However, in raid5_run(), raid5-cache may wake
      up mddev->thread to flush stripes that need to be written back. This
      design doesn't break badly right now. But it could lead to bad bug in
      the future.
      
      This patch tries to resolve this problem by splitting start up work
      into two personality functions, run() and start(). Tasks that do not
      require the md threads should go into run(), while task that require
      the md threads go into start().
      
      r5l_load_log() is moved to raid5_start(), so it is not called until
      the md threads are started in do_md_run().
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      d5d885fd
  29. 02 12月, 2017 1 次提交
  30. 09 11月, 2017 1 次提交
    • N
      md: be cautious about using ->curr_resync_completed for ->recovery_offset · db0505d3
      NeilBrown 提交于
      The ->recovery_offset shows how much of a non-InSync device is actually
      in sync - how much has been recoveryed.
      
      When performing a recovery, ->curr_resync and ->curr_resync_completed
      follow the device address being recovered and so can be used to update
      ->recovery_offset.
      
      When performing a reshape, ->curr_resync* might follow the device
      addresses (raid5) or might follow array addresses (raid10), so cannot
      in general be used to set ->recovery_offset.  When reshaping backwards,
      ->curre_resync* measures from the *end* of the array-or-device, so is
      particularly unhelpful.
      
      So change the common code in md.c to only use ->curr_resync_complete
      for the simple recovery case, and add code to raid5.c to update
      ->recovery_offset during a forwards reshape.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      db0505d3
  31. 02 11月, 2017 3 次提交
    • M
      md: use TASK_IDLE instead of blocking signals · ae89fd3d
      Mikulas Patocka 提交于
      Hi - I submit this patch for the next merge window:
      
      Some times ago, I made a patch f9c79bc0 that blocks signals around the
      schedule() calls in MD. The MD subsystem needs to do an uninterruptible
      sleep that is not accounted in load average - so we block signals and use
      interruptible sleep.
      
      The kernel has a special TASK_IDLE state for this purpose, so we can use
      it instead of blocking signals. This patch doesn't fix any bug, it just
      makes the code simpler.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Acked-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      ae89fd3d
    • N
      md: remove special meaning of ->quiesce(.., 2) · b03e0ccb
      NeilBrown 提交于
      The '2' argument means "wake up anything that is waiting".
      This is an inelegant part of the design and was added
      to help support management of suspend_lo/suspend_hi setting.
      Now that suspend_lo/hi is managed in mddev_suspend/resume,
      that need is gone.
      These is still a couple of places where we call 'quiesce'
      with an argument of '2', but they can safely be changed to
      call ->quiesce(.., 1); ->quiesce(.., 0) which
      achieve the same result at the small cost of pausing IO
      briefly.
      
      This removes a small "optimization" from suspend_{hi,lo}_store,
      but it isn't clear that optimization served a useful purpose.
      The code now is a lot clearer.
      Suggested-by: NShaohua Li <shli@kernel.org>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      b03e0ccb
    • N
      md: move suspend_hi/lo handling into core md code · b3143b9a
      NeilBrown 提交于
      responding to ->suspend_lo and ->suspend_hi is similar
      to responding to ->suspended.  It is best to wait in
      the common core code without incrementing ->active_io.
      This allows mddev_suspend()/mddev_resume() to work while
      requests are waiting for suspend_lo/hi to change.
      This is will be important after a subsequent patch
      which uses mddev_suspend() to synchronize updating for
      suspend_lo/hi.
      
      So move the code for testing suspend_lo/hi out of raid1.c
      and raid5.c, and place it in md.c
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      b3143b9a